Skip to contents

rurl 1.4.0

Accessor improvements

All new arguments default to the same values as safe_parse_url(), so existing calls are unaffected.

Behavior change

  • The domain-family accessors (get_domain(), get_tld(), get_subdomain()) now follow host_encoding (default "keep") instead of always returning Unicode. Under "keep" the emitted domain/TLD/ subdomain mirrors the input host’s own spelling: an A-label (xn--…) host yields A-label parts, a Unicode host yields Unicode parts. Pass host_encoding = "unicode" for the previous always-decoded output, or "idna" to force A-labels. This makes the domain accessors consistent with get_host(), whose host_encoding already defaulted to "keep".

Internal

  • Parse-status string literals replaced by named constants (R/status-constants.R) and predicates (.is_ok_status(), .is_warning_status(), .is_joinable_status()).
  • Cache touchpoints in R/zzz.R now driven from a single .CACHE_REGISTRY instead of repeating cache names by hand.
  • Cleared the lintr/goodpractice findings across R/ and the tests (e.g. fixed = TRUE dot splits, condition-message construction, dropped unnecessary lambdas) with no behavior change.
  • .lintr now mirrors goodpractice’s linter set, so a local lintr::lint_package() matches the goodpractice report; intentional test-idiom deviations are documented in the config header.
  • Restored 100% line coverage: added targeted tests for the .punycode_to_unicode(""), .host_is_ace(), and .cache_enabled() guard branches and the derive_parse_status() NA-host-dot fallback (and fixed an over-escaped regex literal that left the scheme-slash NA guard untested). The two genuinely unreachable www-prefix regex-capture fallbacks are now marked # nocov with justification.
  • Reduced the cyclomatic complexity of canonical_join() (47→7), get_subdomain() (26→6), rurl_cache_config() (23→5), and safe_parse_urls() (19→3) by extracting named sub-helpers (e.g. .cj_validate_inputs()/.cj_resolve_sides()/.cj_build_join_df(), .subdomain_labels(), .validate_max_full_parse(), .spu_coerce_original()). No behavior change; no function in the package now exceeds the goodpractice cyclocomp threshold of 15.

rurl 1.3.0

Dependencies

  • Public Suffix List matching is now delegated to the pslr package (Imports: pslr (>= 1.0.1)). rurl no longer ships its own processed copy of the list (R/sysdata.rda) or its embedded matcher, and data-raw/update_psl.R has been removed. punycoder is now required at >= 1.1.0.

Behavior changes (PSL correctness)

The embedded matcher used through 1.2.0 was not fully spec-correct. Delegating to pslr fixes the following; outputs change accordingly:

  • Wildcard rules (*.) are now honored by TLD extraction. For example get_tld("a.b.kobe.jp") is now "b.kobe.jp" (was "kobe.jp").
  • Exception rules (!) are now honored by TLD extraction. For example get_tld("www.ck") is now "ck" (was "www.ck"), and get_tld("foo.ck") is now "foo.ck" (was "ck").
  • IDN hosts now resolve a registered domain in every section. For example get_domain("example.рф") is now "example.рф" (was NA).
  • safe_parse_url() / safe_parse_urls() now derive the domain field using the requested tld_source rather than always using the combined list, so domain and tld are consistent within a parse. Under tld_source = "private" (or "icann"), a host with no suffix in that section now has domain = NA; consequently subdomain_levels_to_keep is a no-op for such hosts (there is no registered domain to trim toward). The default tld_source = "all" is unaffected.
  • Hosts under an unknown TLD continue to return NA for both domain and TLD (rurl queries pslr with unknown = "na"), rather than treating an unknown single label as a public suffix.

Cache changes

rurl 1.2.0

CRAN release: 2026-06-19

Dependencies

  • punycoder (used for IDNA/Punycode encoding and decoding) is now on CRAN. DESCRIPTION requires punycoder (>= 1.0.0).

Behavior changes

  • The package-wide default for case_handling is now "lower_host" (was "keep" for safe_parse_url(), safe_parse_urls(), get_clean_url(), and the get_*() accessors, and "lower" for get_path()). This is the RFC 3986 §6.2.2.1 normalization: the case-insensitive scheme and host fold to lowercase while the case-sensitive path is preserved. With the previous defaults, hosts such as WWW.Example.COM and www.example.com did not fold to one identity, and get_path() silently lowercased paths (two pages that differ only by path casing collapsed to one). Pass case_handling = "keep" to restore the previous reconstruction, or "lower" to lowercase the whole URL including the path. (RURL-lzepdnmm)

rurl 1.1.0

New features

  • canonical_join() gains name_A / name_B arguments to set the output original-URL column names explicitly. They default to NULL, preserving the previous deparse(substitute()) behavior; supply them for stable names when piping or passing anonymous inputs (e.g. canonical_join(df[df$x > 1, ], get_b())), which otherwise produced unstable column names. (RURL-fsygrelr)

  • canonical_join() gains a join_parse_status argument controlling which parse statuses yield joinable keys. The default "ok" preserves the previous behavior (only ok* statuses join); "ok_or_warning" additionally treats the parseable-but-suspicious warning-* statuses (warning-no-tld, warning-invalid-tld, warning-public-suffix) as joinable, at the cost of more potential false-positive matches. (RURL-edqdrvfu)

  • Cache introspection and configuration. rurl_cache_info() reports the entry count, enabled state, and any bound for each memoization cache (full_parse, domain, tld). rurl_cache_config() enables or disables individual caches and sets an optional max_full_parse bound on the full-parse cache (default Inf, preserving the previous unbounded behavior); when the bound is reached the cache is reset so peak memory stays bounded. The domain and tld caches remain unbounded by design — they grow with the number of unique hosts, not with URL/option combinations — and can be disabled for workloads with very many unique hosts. (RURL-iuotpaqs)

Bug fixes

  • safe_parse_url() now returns port as an integer (or NA_integer_), and safe_parse_urls() no longer errors on URLs that contain an explicit port (e.g. http://example.com:8080/path). Previously the scalar parser returned the port as a character string and the vectorized parser aborted. (RURL-fxyzanfg)
  • Bracketed IPv6 hosts (e.g. http://[2001:db8::1]/) are now correctly detected as IP hosts: is_ip_host is TRUE, parse_status is "ok", and no TLD/domain derivation is attempted — matching how IPv4 hosts were already handled. An over-escaped detection pattern previously prevented this. (RURL-jpqjndld)

Behavior changes (potentially breaking)

  • subdomain_levels_to_keep = N (for N > 0) now keeps the N rightmost subdomain labels as documented, instead of silently retaining all subdomains. For example, safe_parse_url("http://deep.sub.domain.example.com", subdomain_levels_to_keep = 1) now returns host domain.example.com (was deep.sub.domain.example.com). N = 0 (strip all) is unchanged. Code that relied on the previous no-op behavior for N > 0 will see different output. (RURL-szumhumv)

Documentation

  • Documented clean_url composition: it is a normalized canonical key built from scheme, host, and path only. Port, query, fragment, and userinfo are intentionally excluded, and with path_encoding = "decode" the path is shown decoded (human-readable, not guaranteed URL-safe). This matches the existing behavior and the key used by canonical_join() — no behavior change. Corrected a lower_host description that implied userinfo could be retained in clean_url, and fixed a README example whose input contained a literal space (now percent-encoded) so it parses as documented. (RURL-jnboujtd)

rurl 0.3.0

This release adds powerful capabilities for URL normalization and canonical dataset joining. It significantly improves robustness in handling malformed or inconsistent URLs.

Highlights

  • New case_handling and trailing_slash_handling parameters in safe_parse_url() and get_clean_url() provide greater control over URL formatting.
  • Introduced canonical_join() for joining datasets on normalized URL keys.
  • Improved handling of non-standard or malformed schemes like htp://.
  • Fixed parsing for schemeless URLs with ports (e.g., example.com:8080/path).
  • More reliable fallback when curl::curl_parse_url() fails internally.
  • Corrected regular expressions for IPv6 parsing.

rurl 0.2.0

  • First version for a potential CRAN submission.
  • Fully tested across macOS, Windows, and Linux.
  • Achieved 100% unit test coverage.
  • Improved README and documentation.

This release adds robust support for internationalized domain names (IDNs), improves punycode handling, and ensures accurate extraction of TLDs and registered domains.

Highlights

  • Accurate TLD extraction for both ASCII and Unicode domains
  • Graceful fallback when urltools is unavailable
  • NFC normalization with stringi
  • 100% test coverage with edge cases and punycode validation
  • Improved internal helpers and clearer test diagnostics

rurl 0.1.3

Improvements

  • Removed the dependency on the psl package.
  • Implemented an internal registered domain extraction using the Public Suffix List.
  • Added internal update_psl.R script to fetch and process the PSL during development.
  • Improved test coverage to 100%.
  • Cleaned up exports and internal helpers.
  • Updated ignores.
  • Tested on macOS, Windows, and Linux via rhub and win-builder.
  • CRAN checks pass with 0 errors/warnings and only standard notes.

Documentation

  • README updated to reflect the use of the PSL and internal domain logic.
  • LICENSE and attribution clarified for MIT + Mozilla Public Suffix List.

rurl 0.1.2

Stabilization & Coverage

  • Achieved 100% test coverage.
  • Added examples to all exported functions.
  • Improved documentation (@param, @return, etc.) for CRAN compliance.
  • Cleaned up NAMESPACE and removed unnecessary functions like hello().
  • Refined URL parsing logic and improved output consistency.

rurl 0.1.0

  • All get_*() functions are now vectorized and work on character vectors.
  • Deprecated scalar-only behavior.
  • Internal parsing made more robust using curl and psl.
  • Ready for use in mutate() and other tidy workflows.