Home > Software engineering >  Matching URLs containing a string pattern and save the url in a new column in a R dataframe
Matching URLs containing a string pattern and save the url in a new column in a R dataframe

Time:11-14

I have a dataset containing multiple urls as a string in one column 'urls'

urls <- "https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx"
id <- 1

df <- cbind(data.frame(urls), data.frame(id))

I now want to extract the complete domain matching "linkedin.com" and store it in a new column df$linkedin. And do the same for domain matching "medium.com" and store it in a new column df$medium. So the result would be basically

df$linkedin
"https://www.linkedin.com/xx/xxx-xx-xxx/"

df$medium
"https://medium.com/@xxxxx"

Somehow I have today a bad hair day and don't see an elegant solution. Would be awesome if you can help out me here :)

CodePudding user response:

I'll make this a little more interesting by making it two rows:

df2 <- structure(list(urls = c("https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx", "https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy"), id = c(1, 1)), row.names = c(NA, -2L), class = "data.frame")
df2
#                                                                                 urls id
# 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx  1
# 2 https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy  1

base R

baseurls <- c("linkedin", "medium")
newcols <- lapply(setNames(nm = baseurls), function(U) unlist(regmatches(df2$urls, gregexpr(paste0("http[^ ]*", U, "[^ ]*"), df2$urls))))
newcols
# $linkedin
# [1] "https://www.linkedin.com/xx/xxx-xx-xxx/" "https://www.linkedin.com/yy/yyy-yy-yyy/"
# $medium
# [1] "https://medium.com/@xxxxx" "https://medium.com/@yyyyy"
cbind(df2, data.frame(newcols))
#                                                                                 urls id                                linkedin                    medium
# 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx  1 https://www.linkedin.com/xx/xxx-xx-xxx/ https://medium.com/@xxxxx
# 2 https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy  1 https://www.linkedin.com/yy/yyy-yy-yyy/ https://medium.com/@yyyyy

tidyverse

## baseurls <- ...
library(dplyr)
library(stringr) # str_extract
library(purrr)   # map_dfc
map_dfc(setNames(nm = baseurls), ~ str_extract(df2$urls, paste0("http[^ ]*", .x, "[^ ]*"))) %>%
  bind_cols(df2, .)
#                                                                                 urls id                                linkedin                    medium
# 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx  1 https://www.linkedin.com/xx/xxx-xx-xxx/ https://medium.com/@xxxxx
# 2 https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy  1 https://www.linkedin.com/yy/yyy-yy-yyy/ https://medium.com/@yyyyy
  • Related