Home > database >  Lookaround with optional characters
Lookaround with optional characters

Time:09-20

I am trying to extract domains from a dataframe of urls, essentially removing the scheme (i.e., http, https, www) and top-level domain (i.e., com, org, gov). I currently to this in a step-wise manner (see v1-3) and believe I should be able to do this in a single step using lookarounds with optional characters, but my best attempt after testing things out, v4, is still not matching v3.

In pseudocode, I would like to: extract the part of the string preceded by http/s and/or www, and followed by com/org/io/gov.

Thank you for your help!

library(dplyr)
library(stringr)

tibble::tribble(
  ~url,
  "http://example.org",
  "https://www.words.com/",
  "http://www.potato.io/some/more/text",
  "www.apple.sauce.gov/"
) |>
  mutate(
    v1 = str_remove(url, "^https?://"),
    v2 = str_remove(v1, "^www."),
    v3 = str_remove(v2, "\\.(com|org|io|gov).*")
  ) |>
  mutate(
    v4 = str_extract(url, "(?<=https?://(www\\.)?).*(?=\\.(com|org|io|gov))")
  ) |> 
  knitr::kable()
url v1 v2 v3 v4
http://example.org example.org example.org example example
https://www.words.com/ www.words.com/ words.com/ words www.words
http://www.potato.io/some/more/text www.potato.io/some/more/text potato.io/some/more/text potato www.potato
www.apple.sauce.gov/ www.apple.sauce.gov/ apple.sauce.gov/ apple.sauce NA

Created on 2022-09-16 by the reprex package (v2.0.1)

CodePudding user response:

This can be done without lookarounds. From the start match the prefix string and then everything up to the last dot and then the rest. Capture the part to the last dot and return it.

sub("^(https?://)?(www\\.)?(.*)\\..*$", "\\3", x)    
## [1] "example"     "words"       "potato"      "apple.sauce"

2) The urltools package can split the character strings giving a data frame with subdomain, domain and other columns. Use coalesce to replace NA's in subdomain with an empty string, paste together the processed subdomain and domain and finally remove any www. or . prefix.

library(dplyr)
library(urltools)

x %>%
  domain %>%
  suffix_extract %>%
  with(paste0(coalesce(subdomain, ""), ".", domain)) %>%
  sub("^(www)?\\.?", "", .)
## [1] "example"     "words"       "potato"      "apple.sauce"

3) We can simplify (1) by making use of url_parse in the utils package and file_path_sans_ext in the tools package. These come with R so they don't have to be installed.

library(tools)
library(utils)

sub("^(www)?\\.", "", file_path_sans_ext(url_parse(x)$domain))
## [1] "example"     "words"       "potato"      "apple.sauce"

This could also be written as a pipeline:

x |>
  url_parse() |>
  with(domain) |>
  file_path_sans_ext() |>
  sub("^(www)?\\.", "", x = _)
## [1] "example"     "words"       "potato"      "apple.sauce"

Note

x <- c("http://example.org", "https://www.words.com/",
  "http://www.potato.io/some/more/text", "www.apple.sauce.gov/")
  • Related