Skip to contents

Splits a URL-shaped string into coarse components with a hand-rolled splitter, primarily to extract the host for internationalized-domain-name handling, optionally ASCII-encoding it.

Usage

parse_url(url, encode_domains = FALSE)

Arguments

url

Character vector of URL-shaped strings to split

encode_domains

Logical flag; encode parsed host names to ASCII.

Value

An object of class "punycoder_parsed_url" (a named list) with components:

scheme

Character vector of URL schemes (e.g., "https").

domain

Character vector of domain names.

port

Integer vector of port numbers.

path

Character vector of URL paths.

query

Character vector of query strings.

fragment

Character vector of fragment identifiers.

Each component has one element per input URL. Invalid URLs yield NA components. For valid URLs without an explicit path, path is returned as "".

Details

This is **best-effort host extraction, not a conformant URL parser.** It is *not* RFC 3986 / WHATWG URL compliant: there is no percent encoding/decoding, no scheme validation, no robust port/path/query semantics, no full IPv6 (zone IDs / RFC 6874 are unhandled), and no serialization guarantees. The non-host components are returned as a convenience only; for real URL parsing and canonicalization use a dedicated URL package (e.g. `rurl`). This surface is slated for eventual removal in favour of `rurl` consuming punycoder's host functions.

Deprecated

This function is deprecated and slated for removal in a future release. For URL parsing and canonicalization use a dedicated URL package (e.g. `rurl`); for host-only encoding pass the host alone to [host_normalize()] or [puny_encode()].

See also

url_encode, url_decode for URL transformation with IDN handling.

Examples

# \donttest{
# Parse URL with Unicode domain
parse_url(
  "https://caf\u00E9.example.com:8080/path?query=value#fragment"
)
#> Warning: 'parse_url()' is deprecated and will be removed in a future release.
#> Use the 'rurl' package for URL parsing/canonicalization, or host_normalize() / puny_encode() for host-only encoding.
#> Punycoder Parsed URL Results
#> ============================
#> 
#> URL 1 :
#>   Scheme:   https 
#>   Domain:   café.example.com 
#>   Port:     8080 
#>   Path:     /path 
#>   Query:    query=value 
#>   Fragment: fragment 
#> 

# Parse multiple URLs
urls <- c(
  "https://caf\u00E9.com/menu",
  "https://\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444/info"
)
parse_url(urls)
#> Warning: 'parse_url()' is deprecated and will be removed in a future release.
#> Use the 'rurl' package for URL parsing/canonicalization, or host_normalize() / puny_encode() for host-only encoding.
#> Punycoder Parsed URL Results
#> ============================
#> 
#> URL 1 :
#>   Scheme:   https 
#>   Domain:   café.com 
#>   Path:     /menu 
#> 
#> URL 2 :
#>   Scheme:   https 
#>   Domain:   москва.рф 
#>   Path:     /info 
#> 
# }