I am trying to extract domains from a dataframe of urls, essentially removing the scheme (i.e., http, https, www) and top-level domain (i.e., com, org, gov). I currently to this in a step-wise manner (see v1-3
) and believe I should be able to do this in a single step using lookarounds with optional characters, but my best attempt after testing things out, v4
, is still not matching v3
.
In pseudocode, I would like to: extract the part of the string preceded by http/s and/or www, and followed by com/org/io/gov.
Thank you for your help!
library(dplyr)
library(stringr)
tibble::tribble(
~url,
"http://example.org",
"https://www.words.com/",
"http://www.potato.io/some/more/text",
"www.apple.sauce.gov/"
) |>
mutate(
v1 = str_remove(url, "^https?://"),
v2 = str_remove(v1, "^www."),
v3 = str_remove(v2, "\\.(com|org|io|gov).*")
) |>
mutate(
v4 = str_extract(url, "(?<=https?://(www\\.)?).*(?=\\.(com|org|io|gov))")
) |>
knitr::kable()
url | v1 | v2 | v3 | v4 |
---|---|---|---|---|
http://example.org | example.org | example.org | example | example |
https://www.words.com/ | www.words.com/ | words.com/ | words | www.words |
http://www.potato.io/some/more/text | www.potato.io/some/more/text | potato.io/some/more/text | potato | www.potato |
www.apple.sauce.gov/ | www.apple.sauce.gov/ | apple.sauce.gov/ | apple.sauce | NA |
Created on 2022-09-16 by the reprex package (v2.0.1)
CodePudding user response:
This can be done without lookarounds. From the start match the prefix string and then everything up to the last dot and then the rest. Capture the part to the last dot and return it.
sub("^(https?://)?(www\\.)?(.*)\\..*$", "\\3", x)
## [1] "example" "words" "potato" "apple.sauce"
2) The urltools package can split the character strings giving a data frame with subdomain
, domain
and other columns. Use coalesce
to replace NA's in subdomain
with an empty string, paste together the processed subdomain
and domain
and finally remove any www.
or .
prefix.
library(dplyr)
library(urltools)
x %>%
domain %>%
suffix_extract %>%
with(paste0(coalesce(subdomain, ""), ".", domain)) %>%
sub("^(www)?\\.?", "", .)
## [1] "example" "words" "potato" "apple.sauce"
3) We can simplify (1) by making use of url_parse in the utils package and file_path_sans_ext in the tools package. These come with R so they don't have to be installed.
library(tools)
library(utils)
sub("^(www)?\\.", "", file_path_sans_ext(url_parse(x)$domain))
## [1] "example" "words" "potato" "apple.sauce"
This could also be written as a pipeline:
x |>
url_parse() |>
with(domain) |>
file_path_sans_ext() |>
sub("^(www)?\\.", "", x = _)
## [1] "example" "words" "potato" "apple.sauce"
Note
x <- c("http://example.org", "https://www.words.com/",
"http://www.potato.io/some/more/text", "www.apple.sauce.gov/")