Vectorized wrapper around safe_parse_url that returns a
data.frame with one row per input URL.
Usage
safe_parse_urls(
url,
protocol_handling = c("keep", "none", "strip", "http", "https"),
www_handling = c("none", "strip", "keep", "if_no_subdomain"),
tld_source = c("all", "private", "icann"),
case_handling = c("lower_host", "keep", "lower", "upper"),
trailing_slash_handling = c("none", "keep", "strip"),
index_page_handling = c("keep", "strip"),
path_normalization = c("none", "collapse_slashes", "dot_segments", "both"),
scheme_relative_handling = c("keep", "http", "https", "error"),
subdomain_levels_to_keep = NULL,
host_encoding = c("keep", "idna", "unicode"),
path_encoding = c("keep", "encode", "decode")
)Arguments
- url
A character vector of URLs to be parsed.
- protocol_handling
A character string specifying how to handle protocols. Defaults to "keep".
"keep": If a scheme exists (http, https, ftp, ftps), it's used. If no scheme, "http://" is added.
"none": If a scheme exists, it's used. If no scheme, then no scheme is used (scheme component will be NA).
"strip": Any existing scheme is removed (scheme component will be NA).
"http": The scheme is forced to be "http".
"https": The scheme is forced to be "https".
- www_handling
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
"none": (Default) Leaves the host's www prefix (or lack thereof) untouched.
"strip": Removes any "www." or "www[number]." prefix.
"keep": Ensures the host starts with "www.". If it has "www[number].", it's normalized to "www.". If no www prefix, "www." is added. An empty input host remains empty.
"if_no_subdomain": If the host is a bare registered domain (e.g., "example.com"), "www." is added. If the host already has a "www." or "www[number]." prefix, it is normalized to "www." (e.g., "www1.example.com" becomes "www.example.com"; "www1.sub.example.com" becomes "www.sub.example.com"). If a non-www subdomain exists (e.g., "sub.example.com" or the normalized "www.sub.example.com"), the host is not further altered. An empty input host remains empty.
- tld_source
Which TLD source to use for TLD extraction: "all", "icann", or "private". Defaults to "all".
- case_handling
A character string specifying how to handle the case of the cleaned URL. Defaults to "lower_host", the RFC 3986 §6.2.2.1 normalization (scheme and host are case-insensitive and folded to lowercase; the path is case-sensitive and preserved).
"lower_host": (Default) Lowercases scheme and host only; the path keeps its original casing.
"keep": Preserves casing of the reconstructed URL.
"lower": Converts the cleaned URL to lowercase.
"upper": Converts the cleaned URL to uppercase.
- trailing_slash_handling
A character string specifying how to handle trailing slashes in the path component of the cleaned URL. Defaults to "none".
"none": (Default) No specific handling is applied. Path remains as is after initial parsing.
"keep": Ensures a trailing slash. If a path exists and doesn't end with one, it's added. If path is just "/", it's kept.
"strip": Removes a trailing slash if present, unless the path is solely "/".
- index_page_handling
A character string specifying how to handle index/default pages. Defaults to "keep".
"keep": (Default) Leave index/default page segments untouched.
"strip": Remove a trailing index.* or default.* segment (case-insensitive).
- path_normalization
How to normalize path structure. Defaults to "none".
"none": (Default) No normalization.
"collapse_slashes": Collapse duplicate slashes in the path.
"dot_segments": Resolve . and .. segments per RFC 3986.
"both": Apply both collapse_slashes and dot_segments.
- scheme_relative_handling
How to handle URLs starting with "//". Defaults to "keep".
"keep": Parse using http but return scheme as NA and set status to "ok-scheme-relative".
"http": Assume http for parsing and output.
"https": Assume https for parsing and output.
"error": Treat scheme-relative URLs as invalid.
- subdomain_levels_to_keep
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by `www_handling`.
`NULL`: (Default) No specific subdomain stripping is performed beyond `www_handling`.
`0`: All subdomains are stripped. If `www_handling` preserved or added 'www.', it remains (e.g., 'www.sub.example.com' becomes 'www.example.com'; 'sub.example.com' becomes 'example.com').
`N > 0`: Keeps up to N levels of subdomains, counted from right-to-left (closest to the registered domain), in addition to any 'www.' prefix. E.g., if N=1, 'three.two.one.example.com' becomes 'one.example.com'; 'www.three.two.one.example.com' (post www_handling) becomes 'www.one.example.com'.
- host_encoding
How to present the host in `clean_url`. Defaults to "keep".
"keep": Leave host as parsed by curl (may preserve original case).
"idna": Convert Unicode host labels to Punycode (IDNA) for the cleaned URL.
"unicode": Decode Punycode labels to Unicode for the cleaned URL.
- path_encoding
How to handle percent-encoding in the path for `clean_url`. Defaults to "keep".
"keep": Leave the path percent-encoding untouched.
"encode": Normalize by decoding first, then percent-encoding each segment (slashes preserved).
"decode": Percent-decode UTF-8 sequences in the path.
Value
A data.frame with one row per URL and the same fields returned by
safe_parse_url. Invalid inputs return NA fields with
parse_status = "error".
Examples
safe_parse_urls(c("example.com", "https://www.example.com/path"))
#> original_url scheme host port path query fragment
#> 1 example.com http example.com NA / <NA> <NA>
#> 2 https://www.example.com/path https www.example.com NA /path <NA> <NA>
#> user password domain tld is_ip_host clean_url
#> 1 <NA> <NA> example.com com FALSE http://example.com/
#> 2 <NA> <NA> example.com com FALSE https://www.example.com/path
#> parse_status
#> 1 ok
#> 2 ok