Skip to contents

This function returns the cleaned version of the URLs after applying protocol, www, case, and trailing slash handling rules. The result is a normalized canonical key composed of scheme, host, and path only; port, query, fragment, and userinfo are intentionally excluded (use get_port, get_query, get_fragment, or get_userinfo for those).

Usage

get_clean_url(
  url,
  protocol_handling = "keep",
  www_handling = "none",
  source = c("all", "private", "icann"),
  case_handling = "lower_host",
  trailing_slash_handling = "none",
  index_page_handling = "keep",
  path_normalization = "none",
  scheme_relative_handling = "keep",
  subdomain_levels_to_keep = NULL,
  host_encoding = "keep",
  path_encoding = "keep"
)

Arguments

url

A character vector containing URLs to be parsed.

protocol_handling

A character string specifying how to handle protocols. Defaults to "keep".

  • "keep": If a scheme exists (http, https, ftp, ftps), it's used. If no scheme, "http://" is added.

  • "none": If a scheme exists, it's used. If no scheme, then no scheme is used (scheme component will be NA).

  • "strip": Any existing scheme is removed (scheme component will be NA).

  • "http": The scheme is forced to be "http".

  • "https": The scheme is forced to be "https".

www_handling

A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".

  • "none": (Default) Leaves the host's www prefix (or lack thereof) untouched.

  • "strip": Removes any "www." or "www[number]." prefix.

  • "keep": Ensures the host starts with "www.". If it has "www[number].", it's normalized to "www.". If no www prefix, "www." is added. An empty input host remains empty.

  • "if_no_subdomain": If the host is a bare registered domain (e.g., "example.com"), "www." is added. If the host already has a "www." or "www[number]." prefix, it is normalized to "www." (e.g., "www1.example.com" becomes "www.example.com"; "www1.sub.example.com" becomes "www.sub.example.com"). If a non-www subdomain exists (e.g., "sub.example.com" or the normalized "www.sub.example.com"), the host is not further altered. An empty input host remains empty.

source

Which PSL source to use: "all", "private", or "icann". Subdomain trimming depends on which section is consulted, so pass source = "icann" to exclude private suffixes (e.g. github.io).

case_handling

A character string specifying how to handle the case of the cleaned URL. Defaults to "lower_host", the RFC 3986 §6.2.2.1 normalization (scheme and host are case-insensitive and folded to lowercase; the path is case-sensitive and preserved).

  • "lower_host": (Default) Lowercases scheme and host only; the path keeps its original casing.

  • "keep": Preserves casing of the reconstructed URL.

  • "lower": Converts the cleaned URL to lowercase.

  • "upper": Converts the cleaned URL to uppercase.

trailing_slash_handling

A character string specifying how to handle trailing slashes in the path component of the cleaned URL. Defaults to "none".

  • "none": (Default) No specific handling is applied. Path remains as is after initial parsing.

  • "keep": Ensures a trailing slash. If a path exists and doesn't end with one, it's added. If path is just "/", it's kept.

  • "strip": Removes a trailing slash if present, unless the path is solely "/".

index_page_handling

A character string specifying how to handle index/default pages. Defaults to "keep".

  • "keep": (Default) Leave index/default page segments untouched.

  • "strip": Remove a trailing index.* or default.* segment (case-insensitive).

path_normalization

How to normalize path structure. Defaults to "none".

  • "none": (Default) No normalization.

  • "collapse_slashes": Collapse duplicate slashes in the path.

  • "dot_segments": Resolve . and .. segments per RFC 3986.

  • "both": Apply both collapse_slashes and dot_segments.

scheme_relative_handling

How to handle URLs starting with "//". Defaults to "keep".

  • "keep": Parse using http but return scheme as NA and set status to "ok-scheme-relative".

  • "http": Assume http for parsing and output.

  • "https": Assume https for parsing and output.

  • "error": Treat scheme-relative URLs as invalid.

subdomain_levels_to_keep

An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by `www_handling`.

  • `NULL`: (Default) No specific subdomain stripping is performed beyond `www_handling`.

  • `0`: All subdomains are stripped. If `www_handling` preserved or added 'www.', it remains (e.g., 'www.sub.example.com' becomes 'www.example.com'; 'sub.example.com' becomes 'example.com').

  • `N > 0`: Keeps up to N levels of subdomains, counted from right-to-left (closest to the registered domain), in addition to any 'www.' prefix. E.g., if N=1, 'three.two.one.example.com' becomes 'one.example.com'; 'www.three.two.one.example.com' (post www_handling) becomes 'www.one.example.com'.

host_encoding

How to present the host in `clean_url`. Defaults to "keep".

  • "keep": Leave host as parsed by curl (may preserve original case).

  • "idna": Convert Unicode host labels to Punycode (IDNA) for the cleaned URL.

  • "unicode": Decode Punycode labels to Unicode for the cleaned URL.

path_encoding

How to handle percent-encoding in the path for `clean_url`. Defaults to "keep".

  • "keep": Leave the path percent-encoding untouched.

  • "encode": Normalize by decoding first, then percent-encoding each segment (slashes preserved).

  • "decode": Percent-decode UTF-8 sequences in the path.

Value

A character vector of cleaned URLs.

Examples

get_clean_url("Example.COM/Path") # Default lower_host: host folds, path kept
#>          Example.COM/Path 
#> "http://example.com/Path" 
get_clean_url(
  "Example.COM/Path",
  case_handling = "keep",
  trailing_slash_handling = "keep"
)
#>           Example.COM/Path 
#> "http://Example.COM/Path/" 
get_clean_url(
  "Example.COM/Path/",
  case_handling = "upper",
  trailing_slash_handling = "strip"
)
#>         Example.COM/Path/ 
#> "HTTP://EXAMPLE.COM/PATH" 
get_clean_url("http://example.com", www_handling = "strip")
#>    http://example.com 
#> "http://example.com/" 
get_clean_url(
  "http://deep.sub.domain.example.com/path",
  subdomain_levels_to_keep = 0
)
#> http://deep.sub.domain.example.com/path 
#>               "http://example.com/path" 
# -> "http://example.com/path"
get_clean_url(
  "http://www.deep.sub.domain.example.com/path",
  subdomain_levels_to_keep = 1,
  www_handling = "strip"
)
#> http://www.deep.sub.domain.example.com/path 
#>            "http://domain.example.com/path" 
# -> "http://domain.example.com/path"
get_clean_url(
  "http://www.deep.sub.domain.example.com/path",
  subdomain_levels_to_keep = 1,
  www_handling = "keep"
)
#> http://www.deep.sub.domain.example.com/path 
#>        "http://www.domain.example.com/path" 
# -> "http://www.domain.example.com/path"