This function returns the cleaned version of the URLs after applying
protocol, www, case, and trailing slash handling rules. The result is a
normalized canonical key composed of scheme, host, and path only; port,
query, fragment, and userinfo are intentionally excluded (use
get_port, get_query, get_fragment,
or get_userinfo for those).
Usage
get_clean_url(
url,
protocol_handling = "keep",
www_handling = "none",
source = c("all", "private", "icann"),
case_handling = "lower_host",
trailing_slash_handling = "none",
index_page_handling = "keep",
path_normalization = "none",
scheme_relative_handling = "keep",
subdomain_levels_to_keep = NULL,
host_encoding = "keep",
path_encoding = "keep"
)Arguments
- url
A character vector containing URLs to be parsed.
- protocol_handling
A character string specifying how to handle protocols. Defaults to "keep".
"keep": If a scheme exists (http, https, ftp, ftps), it's used. If no scheme, "http://" is added.
"none": If a scheme exists, it's used. If no scheme, then no scheme is used (scheme component will be NA).
"strip": Any existing scheme is removed (scheme component will be NA).
"http": The scheme is forced to be "http".
"https": The scheme is forced to be "https".
- www_handling
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
"none": (Default) Leaves the host's www prefix (or lack thereof) untouched.
"strip": Removes any "www." or "www[number]." prefix.
"keep": Ensures the host starts with "www.". If it has "www[number].", it's normalized to "www.". If no www prefix, "www." is added. An empty input host remains empty.
"if_no_subdomain": If the host is a bare registered domain (e.g., "example.com"), "www." is added. If the host already has a "www." or "www[number]." prefix, it is normalized to "www." (e.g., "www1.example.com" becomes "www.example.com"; "www1.sub.example.com" becomes "www.sub.example.com"). If a non-www subdomain exists (e.g., "sub.example.com" or the normalized "www.sub.example.com"), the host is not further altered. An empty input host remains empty.
- source
Which PSL source to use: "all", "private", or "icann". Subdomain trimming depends on which section is consulted, so pass
source = "icann"to exclude private suffixes (e.g. github.io).- case_handling
A character string specifying how to handle the case of the cleaned URL. Defaults to "lower_host", the RFC 3986 §6.2.2.1 normalization (scheme and host are case-insensitive and folded to lowercase; the path is case-sensitive and preserved).
"lower_host": (Default) Lowercases scheme and host only; the path keeps its original casing.
"keep": Preserves casing of the reconstructed URL.
"lower": Converts the cleaned URL to lowercase.
"upper": Converts the cleaned URL to uppercase.
- trailing_slash_handling
A character string specifying how to handle trailing slashes in the path component of the cleaned URL. Defaults to "none".
"none": (Default) No specific handling is applied. Path remains as is after initial parsing.
"keep": Ensures a trailing slash. If a path exists and doesn't end with one, it's added. If path is just "/", it's kept.
"strip": Removes a trailing slash if present, unless the path is solely "/".
- index_page_handling
A character string specifying how to handle index/default pages. Defaults to "keep".
"keep": (Default) Leave index/default page segments untouched.
"strip": Remove a trailing index.* or default.* segment (case-insensitive).
- path_normalization
How to normalize path structure. Defaults to "none".
"none": (Default) No normalization.
"collapse_slashes": Collapse duplicate slashes in the path.
"dot_segments": Resolve . and .. segments per RFC 3986.
"both": Apply both collapse_slashes and dot_segments.
- scheme_relative_handling
How to handle URLs starting with "//". Defaults to "keep".
"keep": Parse using http but return scheme as NA and set status to "ok-scheme-relative".
"http": Assume http for parsing and output.
"https": Assume https for parsing and output.
"error": Treat scheme-relative URLs as invalid.
- subdomain_levels_to_keep
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by `www_handling`.
`NULL`: (Default) No specific subdomain stripping is performed beyond `www_handling`.
`0`: All subdomains are stripped. If `www_handling` preserved or added 'www.', it remains (e.g., 'www.sub.example.com' becomes 'www.example.com'; 'sub.example.com' becomes 'example.com').
`N > 0`: Keeps up to N levels of subdomains, counted from right-to-left (closest to the registered domain), in addition to any 'www.' prefix. E.g., if N=1, 'three.two.one.example.com' becomes 'one.example.com'; 'www.three.two.one.example.com' (post www_handling) becomes 'www.one.example.com'.
- host_encoding
How to present the host in `clean_url`. Defaults to "keep".
"keep": Leave host as parsed by curl (may preserve original case).
"idna": Convert Unicode host labels to Punycode (IDNA) for the cleaned URL.
"unicode": Decode Punycode labels to Unicode for the cleaned URL.
- path_encoding
How to handle percent-encoding in the path for `clean_url`. Defaults to "keep".
"keep": Leave the path percent-encoding untouched.
"encode": Normalize by decoding first, then percent-encoding each segment (slashes preserved).
"decode": Percent-decode UTF-8 sequences in the path.
Examples
get_clean_url("Example.COM/Path") # Default lower_host: host folds, path kept
#> Example.COM/Path
#> "http://example.com/Path"
get_clean_url(
"Example.COM/Path",
case_handling = "keep",
trailing_slash_handling = "keep"
)
#> Example.COM/Path
#> "http://Example.COM/Path/"
get_clean_url(
"Example.COM/Path/",
case_handling = "upper",
trailing_slash_handling = "strip"
)
#> Example.COM/Path/
#> "HTTP://EXAMPLE.COM/PATH"
get_clean_url("http://example.com", www_handling = "strip")
#> http://example.com
#> "http://example.com/"
get_clean_url(
"http://deep.sub.domain.example.com/path",
subdomain_levels_to_keep = 0
)
#> http://deep.sub.domain.example.com/path
#> "http://example.com/path"
# -> "http://example.com/path"
get_clean_url(
"http://www.deep.sub.domain.example.com/path",
subdomain_levels_to_keep = 1,
www_handling = "strip"
)
#> http://www.deep.sub.domain.example.com/path
#> "http://domain.example.com/path"
# -> "http://domain.example.com/path"
get_clean_url(
"http://www.deep.sub.domain.example.com/path",
subdomain_levels_to_keep = 1,
www_handling = "keep"
)
#> http://www.deep.sub.domain.example.com/path
#> "http://www.domain.example.com/path"
# -> "http://www.domain.example.com/path"