Changelog • rurl

rurl 1.4.0

Accessor improvements

get_path() gains path_normalization, index_page_handling, trailing_slash_handling, and path_encoding arguments, matching the corresponding options of safe_parse_url().
get_scheme() gains scheme_relative_handling.
get_parse_status() gains source (mapped to tld_source) so warning statuses can be queried under a specific PSL section.
get_clean_url() and get_host() gain source (mapped to tld_source).
get_host() gains host_encoding.
get_domain(), get_tld(), and get_subdomain() gain host_encoding, mirroring get_host().

All new arguments default to the same values as safe_parse_url(), so existing calls are unaffected.

Behavior change

The domain-family accessors (get_domain(), get_tld(), get_subdomain()) now follow host_encoding (default "keep") instead of always returning Unicode. Under "keep" the emitted domain/TLD/ subdomain mirrors the input host’s own spelling: an A-label (xn--…) host yields A-label parts, a Unicode host yields Unicode parts. Pass host_encoding = "unicode" for the previous always-decoded output, or "idna" to force A-labels. This makes the domain accessors consistent with get_host(), whose host_encoding already defaulted to "keep".

Internal

Parse-status string literals replaced by named constants (R/status-constants.R) and predicates (.is_ok_status(), .is_warning_status(), .is_joinable_status()).
Cache touchpoints in R/zzz.R now driven from a single .CACHE_REGISTRY instead of repeating cache names by hand.
Cleared the lintr/goodpractice findings across R/ and the tests (e.g. fixed = TRUE dot splits, condition-message construction, dropped unnecessary lambdas) with no behavior change.
.lintr now mirrors goodpractice’s linter set, so a local lintr::lint_package() matches the goodpractice report; intentional test-idiom deviations are documented in the config header.
Restored 100% line coverage: added targeted tests for the .punycode_to_unicode(""), .host_is_ace(), and .cache_enabled() guard branches and the derive_parse_status() NA-host-dot fallback (and fixed an over-escaped regex literal that left the scheme-slash NA guard untested). The two genuinely unreachable www-prefix regex-capture fallbacks are now marked # nocov with justification.
Reduced the cyclomatic complexity of canonical_join() (47→7), get_subdomain() (26→6), rurl_cache_config() (23→5), and safe_parse_urls() (19→3) by extracting named sub-helpers (e.g. .cj_validate_inputs()/.cj_resolve_sides()/.cj_build_join_df(), .subdomain_labels(), .validate_max_full_parse(), .spu_coerce_original()). No behavior change; no function in the package now exceeds the goodpractice cyclocomp threshold of 15.

rurl 1.3.0

Dependencies

Public Suffix List matching is now delegated to the pslr package (Imports: pslr (>= 1.0.1)). rurl no longer ships its own processed copy of the list (R/sysdata.rda) or its embedded matcher, and data-raw/update_psl.R has been removed. punycoder is now required at >= 1.1.0.

Behavior changes (PSL correctness)

The embedded matcher used through 1.2.0 was not fully spec-correct. Delegating to pslr fixes the following; outputs change accordingly:

Wildcard rules (*.) are now honored by TLD extraction. For example get_tld("a.b.kobe.jp") is now "b.kobe.jp" (was "kobe.jp").
Exception rules (!) are now honored by TLD extraction. For example get_tld("www.ck") is now "ck" (was "www.ck"), and get_tld("foo.ck") is now "foo.ck" (was "ck").
IDN hosts now resolve a registered domain in every section. For example get_domain("example.рф") is now "example.рф" (was NA).
safe_parse_url() / safe_parse_urls() now derive the domain field using the requested tld_source rather than always using the combined list, so domain and tld are consistent within a parse. Under tld_source = "private" (or "icann"), a host with no suffix in that section now has domain = NA; consequently subdomain_levels_to_keep is a no-op for such hosts (there is no registered domain to trim toward). The default tld_source = "all" is unaffected.
Hosts under an unknown TLD continue to return NA for both domain and TLD (rurl queries pslr with unknown = "na"), rather than treating an unknown single label as a public suffix.

Cache changes

The per-host domain and tld memoization caches have been removed; pslr caches its own query results. rurl_cache_config() and rurl_cache_info() now cover only full_parse, puny_encode, and puny_decode, and the domain / tld arguments to rurl_cache_config() no longer exist.

rurl 1.2.0

CRAN release: 2026-06-19

Dependencies

punycoder (used for IDNA/Punycode encoding and decoding) is now on CRAN. DESCRIPTION requires punycoder (>= 1.0.0).

Behavior changes

The package-wide default for case_handling is now "lower_host" (was "keep" for safe_parse_url(), safe_parse_urls(), get_clean_url(), and the get_*() accessors, and "lower" for get_path()). This is the RFC 3986 §6.2.2.1 normalization: the case-insensitive scheme and host fold to lowercase while the case-sensitive path is preserved. With the previous defaults, hosts such as WWW.Example.COM and www.example.com did not fold to one identity, and get_path() silently lowercased paths (two pages that differ only by path casing collapsed to one). Pass case_handling = "keep" to restore the previous reconstruction, or "lower" to lowercase the whole URL including the path. (RURL-lzepdnmm)

rurl 1.1.0

New features

canonical_join() gains name_A / name_B arguments to set the output original-URL column names explicitly. They default to NULL, preserving the previous deparse(substitute()) behavior; supply them for stable names when piping or passing anonymous inputs (e.g. canonical_join(df[df$x > 1, ], get_b())), which otherwise produced unstable column names. (RURL-fsygrelr)
canonical_join() gains a join_parse_status argument controlling which parse statuses yield joinable keys. The default "ok" preserves the previous behavior (only ok* statuses join); "ok_or_warning" additionally treats the parseable-but-suspicious warning-* statuses (warning-no-tld, warning-invalid-tld, warning-public-suffix) as joinable, at the cost of more potential false-positive matches. (RURL-edqdrvfu)
Cache introspection and configuration. rurl_cache_info() reports the entry count, enabled state, and any bound for each memoization cache (full_parse, domain, tld). rurl_cache_config() enables or disables individual caches and sets an optional max_full_parse bound on the full-parse cache (default Inf, preserving the previous unbounded behavior); when the bound is reached the cache is reset so peak memory stays bounded. The domain and tld caches remain unbounded by design — they grow with the number of unique hosts, not with URL/option combinations — and can be disabled for workloads with very many unique hosts. (RURL-iuotpaqs)

Bug fixes

safe_parse_url() now returns port as an integer (or NA_integer_), and safe_parse_urls() no longer errors on URLs that contain an explicit port (e.g. http://example.com:8080/path). Previously the scalar parser returned the port as a character string and the vectorized parser aborted. (RURL-fxyzanfg)
Bracketed IPv6 hosts (e.g. http://[2001:db8::1]/) are now correctly detected as IP hosts: is_ip_host is TRUE, parse_status is "ok", and no TLD/domain derivation is attempted — matching how IPv4 hosts were already handled. An over-escaped detection pattern previously prevented this. (RURL-jpqjndld)

Behavior changes (potentially breaking)

subdomain_levels_to_keep = N (for N > 0) now keeps the N rightmost subdomain labels as documented, instead of silently retaining all subdomains. For example, safe_parse_url("http://deep.sub.domain.example.com", subdomain_levels_to_keep = 1) now returns host domain.example.com (was deep.sub.domain.example.com). N = 0 (strip all) is unchanged. Code that relied on the previous no-op behavior for N > 0 will see different output. (RURL-szumhumv)

Documentation

Documented clean_url composition: it is a normalized canonical key built from scheme, host, and path only. Port, query, fragment, and userinfo are intentionally excluded, and with path_encoding = "decode" the path is shown decoded (human-readable, not guaranteed URL-safe). This matches the existing behavior and the key used by canonical_join() — no behavior change. Corrected a lower_host description that implied userinfo could be retained in clean_url, and fixed a README example whose input contained a literal space (now percent-encoded) so it parses as documented. (RURL-jnboujtd)

rurl 0.3.0

This release adds powerful capabilities for URL normalization and canonical dataset joining. It significantly improves robustness in handling malformed or inconsistent URLs.

Highlights

New case_handling and trailing_slash_handling parameters in safe_parse_url() and get_clean_url() provide greater control over URL formatting.
Introduced canonical_join() for joining datasets on normalized URL keys.
Improved handling of non-standard or malformed schemes like htp://.
Fixed parsing for schemeless URLs with ports (e.g., example.com:8080/path).
More reliable fallback when curl::curl_parse_url() fails internally.
Corrected regular expressions for IPv6 parsing.

rurl 0.2.0

First version for a potential CRAN submission.
Fully tested across macOS, Windows, and Linux.
Achieved 100% unit test coverage.
Improved README and documentation.

This release adds robust support for internationalized domain names (IDNs), improves punycode handling, and ensures accurate extraction of TLDs and registered domains.

Highlights

Accurate TLD extraction for both ASCII and Unicode domains
Graceful fallback when urltools is unavailable
NFC normalization with stringi
100% test coverage with edge cases and punycode validation
Improved internal helpers and clearer test diagnostics

rurl 0.1.3

Improvements

Removed the dependency on the psl package.
Implemented an internal registered domain extraction using the Public Suffix List.
Added internal update_psl.R script to fetch and process the PSL during development.
Improved test coverage to 100%.
Cleaned up exports and internal helpers.
Updated ignores.
Tested on macOS, Windows, and Linux via rhub and win-builder.
CRAN checks pass with 0 errors/warnings and only standard notes.

Documentation

README updated to reflect the use of the PSL and internal domain logic.
LICENSE and attribution clarified for MIT + Mozilla Public Suffix List.

rurl 0.1.2

Stabilization & Coverage

Achieved 100% test coverage.
Added examples to all exported functions.
Improved documentation (@param, @return, etc.) for CRAN compliance.
Cleaned up NAMESPACE and removed unnecessary functions like hello().
Refined URL parsing logic and improved output consistency.

rurl 0.1.0

All get_*() functions are now vectorized and work on character vectors.
Deprecated scalar-only behavior.
Internal parsing made more robust using curl and psl.
Ready for use in mutate() and other tidy workflows.