Lookaround with optional characters-CodePudding

I am trying to extract domains from a dataframe of urls, essentially removing the scheme (i.e., http, https, www) and top-level domain (i.e., com, org, gov). I currently to this in a step-wise manner (see v1-3) and believe I should be able to do this in a single step using lookarounds with optional characters, but my best attempt after testing things out, v4, is still not matching v3.

In pseudocode, I would like to: extract the part of the string preceded by http/s and/or www, and followed by com/org/io/gov.

Thank you for your help!

library(dplyr)
library(stringr)

tibble::tribble(
  ~url,
  "http://example.org",
  "https://www.words.com/",
  "http://www.potato.io/some/more/text",
  "www.apple.sauce.gov/"
) |>
  mutate(
    v1 = str_remove(url, "^https?://"),
    v2 = str_remove(v1, "^www."),
    v3 = str_remove(v2, "\\.(com|org|io|gov).*")
  ) |>
  mutate(
    v4 = str_extract(url, "(?<=https?://(www\\.)?).*(?=\\.(com|org|io|gov))")
  ) |> 
  knitr::kable()

url	v1	v2	v3	v4
http://example.org	example.org	example.org	example	example
https://www.words.com/	www.words.com/	words.com/	words	www.words
http://www.potato.io/some/more/text	www.potato.io/some/more/text	potato.io/some/more/text	potato	www.potato
www.apple.sauce.gov/	www.apple.sauce.gov/	apple.sauce.gov/	apple.sauce	NA

^{Created on 2022-09-16 by the reprex package (v2.0.1)}

CodePudding user response：

This can be done without lookarounds. From the start match the prefix string and then everything up to the last dot and then the rest. Capture the part to the last dot and return it.

sub("^(https?://)?(www\\.)?(.*)\\..*$", "\\3", x)    
## [1] "example"     "words"       "potato"      "apple.sauce"

2) The urltools package can split the character strings giving a data frame with subdomain, domain and other columns. Use coalesce to replace NA's in subdomain with an empty string, paste together the processed subdomain and domain and finally remove any www. or . prefix.

library(dplyr)
library(urltools)

x %>%
  domain %>%
  suffix_extract %>%
  with(paste0(coalesce(subdomain, ""), ".", domain)) %>%
  sub("^(www)?\\.?", "", .)
## [1] "example"     "words"       "potato"      "apple.sauce"

3) We can simplify (1) by making use of url_parse in the utils package and file_path_sans_ext in the tools package. These come with R so they don't have to be installed.

library(tools)
library(utils)

sub("^(www)?\\.", "", file_path_sans_ext(url_parse(x)$domain))
## [1] "example"     "words"       "potato"      "apple.sauce"

This could also be written as a pipeline:

x |>
  url_parse() |>
  with(domain) |>
  file_path_sans_ext() |>
  sub("^(www)?\\.", "", x = _)
## [1] "example"     "words"       "potato"      "apple.sauce"

Note

x <- c("http://example.org", "https://www.words.com/",
  "http://www.potato.io/some/more/text", "www.apple.sauce.gov/")