Changelog
Source:NEWS.md
rurl 1.4.0
Accessor improvements
-
get_path()gainspath_normalization,index_page_handling,trailing_slash_handling, andpath_encodingarguments, matching the corresponding options ofsafe_parse_url(). -
get_scheme()gainsscheme_relative_handling. -
get_parse_status()gainssource(mapped totld_source) so warning statuses can be queried under a specific PSL section. -
get_clean_url()andget_host()gainsource(mapped totld_source). -
get_host()gainshost_encoding. -
get_domain(),get_tld(), andget_subdomain()gainhost_encoding, mirroringget_host().
All new arguments default to the same values as safe_parse_url(), so existing calls are unaffected.
Behavior change
- The domain-family accessors (
get_domain(),get_tld(),get_subdomain()) now followhost_encoding(default"keep") instead of always returning Unicode. Under"keep"the emitted domain/TLD/ subdomain mirrors the input host’s own spelling: an A-label (xn--…) host yields A-label parts, a Unicode host yields Unicode parts. Passhost_encoding = "unicode"for the previous always-decoded output, or"idna"to force A-labels. This makes the domain accessors consistent withget_host(), whosehost_encodingalready defaulted to"keep".
Internal
- Parse-status string literals replaced by named constants (
R/status-constants.R) and predicates (.is_ok_status(),.is_warning_status(),.is_joinable_status()). - Cache touchpoints in
R/zzz.Rnow driven from a single.CACHE_REGISTRYinstead of repeating cache names by hand. - Cleared the
lintr/goodpracticefindings acrossR/and the tests (e.g.fixed = TRUEdot splits, condition-message construction, dropped unnecessary lambdas) with no behavior change. -
.lintrnow mirrorsgoodpractice’s linter set, so a locallintr::lint_package()matches thegoodpracticereport; intentional test-idiom deviations are documented in the config header. - Restored 100% line coverage: added targeted tests for the
.punycode_to_unicode(""),.host_is_ace(), and.cache_enabled()guard branches and thederive_parse_status()NA-host-dot fallback (and fixed an over-escaped regex literal that left the scheme-slash NA guard untested). The two genuinely unreachablewww-prefix regex-capture fallbacks are now marked# nocovwith justification. - Reduced the cyclomatic complexity of
canonical_join()(47→7),get_subdomain()(26→6),rurl_cache_config()(23→5), andsafe_parse_urls()(19→3) by extracting named sub-helpers (e.g..cj_validate_inputs()/.cj_resolve_sides()/.cj_build_join_df(),.subdomain_labels(),.validate_max_full_parse(),.spu_coerce_original()). No behavior change; no function in the package now exceeds thegoodpracticecyclocomp threshold of 15.
rurl 1.3.0
Dependencies
- Public Suffix List matching is now delegated to the
pslrpackage (Imports: pslr (>= 1.0.1)).rurlno longer ships its own processed copy of the list (R/sysdata.rda) or its embedded matcher, anddata-raw/update_psl.Rhas been removed.punycoderis now required at>= 1.1.0.
Behavior changes (PSL correctness)
The embedded matcher used through 1.2.0 was not fully spec-correct. Delegating to pslr fixes the following; outputs change accordingly:
-
Wildcard rules (
*.) are now honored by TLD extraction. For exampleget_tld("a.b.kobe.jp")is now"b.kobe.jp"(was"kobe.jp"). -
Exception rules (
!) are now honored by TLD extraction. For exampleget_tld("www.ck")is now"ck"(was"www.ck"), andget_tld("foo.ck")is now"foo.ck"(was"ck"). -
IDN hosts now resolve a registered domain in every section. For example
get_domain("example.рф")is now"example.рф"(wasNA). -
safe_parse_url()/safe_parse_urls()now derive thedomainfield using the requestedtld_sourcerather than always using the combined list, sodomainandtldare consistent within a parse. Undertld_source = "private"(or"icann"), a host with no suffix in that section now hasdomain = NA; consequentlysubdomain_levels_to_keepis a no-op for such hosts (there is no registered domain to trim toward). The defaulttld_source = "all"is unaffected. - Hosts under an unknown TLD continue to return
NAfor both domain and TLD (rurlqueriespslrwithunknown = "na"), rather than treating an unknown single label as a public suffix.
Cache changes
- The per-host
domainandtldmemoization caches have been removed;pslrcaches its own query results.rurl_cache_config()andrurl_cache_info()now cover onlyfull_parse,puny_encode, andpuny_decode, and thedomain/tldarguments torurl_cache_config()no longer exist.
rurl 1.2.0
CRAN release: 2026-06-19
Dependencies
-
punycoder(used for IDNA/Punycode encoding and decoding) is now on CRAN.DESCRIPTIONrequirespunycoder (>= 1.0.0).
Behavior changes
- The package-wide default for
case_handlingis now"lower_host"(was"keep"forsafe_parse_url(),safe_parse_urls(),get_clean_url(), and theget_*()accessors, and"lower"forget_path()). This is the RFC 3986 §6.2.2.1 normalization: the case-insensitive scheme and host fold to lowercase while the case-sensitive path is preserved. With the previous defaults, hosts such asWWW.Example.COMandwww.example.comdid not fold to one identity, andget_path()silently lowercased paths (two pages that differ only by path casing collapsed to one). Passcase_handling = "keep"to restore the previous reconstruction, or"lower"to lowercase the whole URL including the path. (RURL-lzepdnmm)
rurl 1.1.0
New features
canonical_join()gainsname_A/name_Barguments to set the output original-URL column names explicitly. They default toNULL, preserving the previousdeparse(substitute())behavior; supply them for stable names when piping or passing anonymous inputs (e.g.canonical_join(df[df$x > 1, ], get_b())), which otherwise produced unstable column names. (RURL-fsygrelr)canonical_join()gains ajoin_parse_statusargument controlling which parse statuses yield joinable keys. The default"ok"preserves the previous behavior (onlyok*statuses join);"ok_or_warning"additionally treats the parseable-but-suspiciouswarning-*statuses (warning-no-tld,warning-invalid-tld,warning-public-suffix) as joinable, at the cost of more potential false-positive matches. (RURL-edqdrvfu)Cache introspection and configuration.
rurl_cache_info()reports the entry count, enabled state, and any bound for each memoization cache (full_parse,domain,tld).rurl_cache_config()enables or disables individual caches and sets an optionalmax_full_parsebound on the full-parse cache (defaultInf, preserving the previous unbounded behavior); when the bound is reached the cache is reset so peak memory stays bounded. Thedomainandtldcaches remain unbounded by design — they grow with the number of unique hosts, not with URL/option combinations — and can be disabled for workloads with very many unique hosts. (RURL-iuotpaqs)
Bug fixes
-
safe_parse_url()now returnsportas an integer (orNA_integer_), andsafe_parse_urls()no longer errors on URLs that contain an explicit port (e.g.http://example.com:8080/path). Previously the scalar parser returned the port as a character string and the vectorized parser aborted. (RURL-fxyzanfg) - Bracketed IPv6 hosts (e.g.
http://[2001:db8::1]/) are now correctly detected as IP hosts:is_ip_hostisTRUE,parse_statusis"ok", and no TLD/domain derivation is attempted — matching how IPv4 hosts were already handled. An over-escaped detection pattern previously prevented this. (RURL-jpqjndld)
Behavior changes (potentially breaking)
-
subdomain_levels_to_keep = N(forN > 0) now keeps theNrightmost subdomain labels as documented, instead of silently retaining all subdomains. For example,safe_parse_url("http://deep.sub.domain.example.com", subdomain_levels_to_keep = 1)now returns hostdomain.example.com(wasdeep.sub.domain.example.com).N = 0(strip all) is unchanged. Code that relied on the previous no-op behavior forN > 0will see different output. (RURL-szumhumv)
Documentation
- Documented
clean_urlcomposition: it is a normalized canonical key built from scheme, host, and path only. Port, query, fragment, and userinfo are intentionally excluded, and withpath_encoding = "decode"the path is shown decoded (human-readable, not guaranteed URL-safe). This matches the existing behavior and the key used bycanonical_join()— no behavior change. Corrected alower_hostdescription that implied userinfo could be retained inclean_url, and fixed a README example whose input contained a literal space (now percent-encoded) so it parses as documented. (RURL-jnboujtd)
rurl 0.3.0
This release adds powerful capabilities for URL normalization and canonical dataset joining. It significantly improves robustness in handling malformed or inconsistent URLs.
Highlights
- New
case_handlingandtrailing_slash_handlingparameters insafe_parse_url()andget_clean_url()provide greater control over URL formatting. - Introduced
canonical_join()for joining datasets on normalized URL keys. - Improved handling of non-standard or malformed schemes like
htp://. - Fixed parsing for schemeless URLs with ports (e.g.,
example.com:8080/path). - More reliable fallback when
curl::curl_parse_url()fails internally. - Corrected regular expressions for IPv6 parsing.
rurl 0.2.0
- First version for a potential CRAN submission.
- Fully tested across macOS, Windows, and Linux.
- Achieved 100% unit test coverage.
- Improved README and documentation.
This release adds robust support for internationalized domain names (IDNs), improves punycode handling, and ensures accurate extraction of TLDs and registered domains.
rurl 0.1.3
Improvements
- Removed the dependency on the
pslpackage. - Implemented an internal registered domain extraction using the Public Suffix List.
- Added internal
update_psl.Rscript to fetch and process the PSL during development. - Improved test coverage to 100%.
- Cleaned up exports and internal helpers.
- Updated ignores.
- Tested on macOS, Windows, and Linux via rhub and win-builder.
- CRAN checks pass with 0 errors/warnings and only standard notes.