Introduction
The rurl package provides tools to parse, normalize, and
extract information from URLs using a consistent and safe API.
It is fully vectorized and delegates domain handling to the
pslr package, which implements the Public Suffix List for accurate
domain and TLD extraction.
Safe URL Parsing
Use safe_parse_url() to parse URLs robustly:
safe_parse_url("https://sub.example.co.uk/path?q=1")
#> $original_url
#> [1] "https://sub.example.co.uk/path?q=1"
#>
#> $scheme
#> [1] "https"
#>
#> $host
#> [1] "sub.example.co.uk"
#>
#> $port
#> [1] NA
#>
#> $path
#> [1] "/path"
#>
#> $query
#> [1] "q=1"
#>
#> $fragment
#> [1] NA
#>
#> $user
#> [1] NA
#>
#> $password
#> [1] NA
#>
#> $domain
#> [1] "example.co.uk"
#>
#> $tld
#> [1] "co.uk"
#>
#> $is_ip_host
#> [1] FALSE
#>
#> $clean_url
#> [1] "https://sub.example.co.uk/path"
#>
#> $parse_status
#> [1] "ok"The protocol_handling argument controls how schemes are
handled:
-
"keep"(default; keeps the current protocol or prependshttp://if missing) -
"none"(doesn’t add, remove, or change protocols) -
"strip"(removes protocols) -
"http"(changes protocols tohttp://or adds it if missing) -
"https"(changes protocols tohttps://or adds it if missing)
Extracting URL Components
get_scheme("https://sub.example.com")
#> https://sub.example.com
#> "https"
get_host("https://sub.example.com")
#> https://sub.example.com
#> "sub.example.com"
get_path("https://sub.example.com/path/to/page")
#> https://sub.example.com/path/to/page
#> "/path/to/page"Each function works on vectors of URLs and gracefully handles
NA.
Domain and TLD Parsing
These functions rely on the Public Suffix List:
get_domain("https://a.b.example.co.uk")
#> https://a.b.example.co.uk
#> "example.co.uk"Extracting TLDs from different sources:
get_tld("https://foo.blogspot.com")
#> https://foo.blogspot.com
#> "blogspot.com"Sources include: - "all" (default; will match to the
longest available TLD) - "private" (only extract private
TLDs) - "icann" (only extract ICANN TLDs)
Vectorization and Edge Cases
All core functions support vectors and handle malformed inputs safely:
urls <- c("example.com", "http://example.com", NA)
get_clean_url(urls)
#> example.com http://example.com <NA>
#> "http://example.com/" "http://example.com/" NAAdvanced Host Manipulation with
subdomain_levels_to_keep
Several functions, including safe_parse_url(),
get_host(), and get_clean_url(), support the
subdomain_levels_to_keep argument. This allows for
fine-grained control over how many subdomain levels are preserved in the
host component of a URL, after initial
www_handling has been applied.
-
NULL(Default): No specific subdomain stripping is performed beyondwww_handling. -
0: All subdomains are stripped. Ifwww_handlingpreserved or added ‘www.’, it remains (e.g., ‘www.sub.example.com’ becomes ‘www.example.com’; ‘sub.example.com’ becomes ‘example.com’). -
N > 0: Keeps up to N levels of subdomains, counted from right-to-left (closest to the registered domain), in addition to any ‘www.’ prefix.
Here are some examples demonstrating its effect on
get_host():
get_host(
"http://www.three.two.one.example.com",
subdomain_levels_to_keep = 0
) # www_handling default is "none"
#> http://www.three.two.one.example.com
#> "www.example.com"
# Expected: "www.example.com"
get_host(
"http://three.two.one.example.com",
www_handling = "strip",
subdomain_levels_to_keep = 0
)
#> http://three.two.one.example.com
#> "example.com"
# Expected: "example.com"
get_host("http://www.three.two.one.example.com", subdomain_levels_to_keep = 1)
#> http://www.three.two.one.example.com
#> "www.one.example.com"
# Expected: "www.one.example.com"
get_host(
"http://three.two.one.example.com",
www_handling = "strip",
subdomain_levels_to_keep = 1
)
#> http://three.two.one.example.com
#> "one.example.com"
# Expected: "one.example.com"
get_host(
"http://www.three.two.one.example.com",
www_handling = "keep",
subdomain_levels_to_keep = 2
)
#> http://www.three.two.one.example.com
#> "www.two.one.example.com"
# Expected: "www.two.one.example.com"And its effect on get_clean_url():
get_clean_url(
"http://www.deep.sub.example.com/some/path",
subdomain_levels_to_keep = 0,
www_handling = "keep"
)
#> http://www.deep.sub.example.com/some/path
#> "http://www.example.com/some/path"
# yields http://www.example.com/some/path
get_clean_url(
"http://deep.sub.example.com/some/path",
subdomain_levels_to_keep = 1
)
#> http://deep.sub.example.com/some/path
#> "http://sub.example.com/some/path"
# yields http://sub.example.com/some/pathNote that get_domain() also accepts
subdomain_levels_to_keep, but it does not change the
returned domain value. The domain is derived from the host
before this specific host modification occurs. The parameter
influences the host component that might be used in other parts of the
safe_parse_url output, such as the
clean_url.
Summary
- Vectorized functions for parsing and cleaning URLs
- Uses the Public Suffix List for domain logic
- Unicode/punycode support