Splits a URL-shaped string into coarse components with a hand-rolled splitter, primarily to extract the host for internationalized-domain-name handling, optionally ASCII-encoding it.
Value
An object of class "punycoder_parsed_url" (a named list)
with components:
- scheme
Character vector of URL schemes (e.g.,
"https").- domain
Character vector of domain names.
- port
Integer vector of port numbers.
- path
Character vector of URL paths.
- query
Character vector of query strings.
- fragment
Character vector of fragment identifiers.
Each component has one element per input URL. Invalid URLs yield
NA components. For valid URLs without an explicit path,
path is returned as "".
Details
This is **best-effort host extraction, not a conformant URL parser.** It is *not* RFC 3986 / WHATWG URL compliant: there is no percent encoding/decoding, no scheme validation, no robust port/path/query semantics, no full IPv6 (zone IDs / RFC 6874 are unhandled), and no serialization guarantees. The non-host components are returned as a convenience only; for real URL parsing and canonicalization use a dedicated URL package (e.g. `rurl`). This surface is slated for eventual removal in favour of `rurl` consuming punycoder's host functions.
Deprecated
This function is deprecated and slated for removal in a future release. For URL parsing and canonicalization use a dedicated URL package (e.g. `rurl`); for host-only encoding pass the host alone to [host_normalize()] or [puny_encode()].
See also
url_encode, url_decode for URL
transformation with IDN handling.
Examples
# \donttest{
# Parse URL with Unicode domain
parse_url(
"https://caf\u00E9.example.com:8080/path?query=value#fragment"
)
#> Warning: 'parse_url()' is deprecated and will be removed in a future release.
#> Use the 'rurl' package for URL parsing/canonicalization, or host_normalize() / puny_encode() for host-only encoding.
#> Punycoder Parsed URL Results
#> ============================
#>
#> URL 1 :
#> Scheme: https
#> Domain: café.example.com
#> Port: 8080
#> Path: /path
#> Query: query=value
#> Fragment: fragment
#>
# Parse multiple URLs
urls <- c(
"https://caf\u00E9.com/menu",
"https://\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444/info"
)
parse_url(urls)
#> Warning: 'parse_url()' is deprecated and will be removed in a future release.
#> Use the 'rurl' package for URL parsing/canonicalization, or host_normalize() / puny_encode() for host-only encoding.
#> Punycoder Parsed URL Results
#> ============================
#>
#> URL 1 :
#> Scheme: https
#> Domain: café.com
#> Path: /menu
#>
#> URL 2 :
#> Scheme: https
#> Domain: москва.рф
#> Path: /info
#>
# }