I have a dataset containing multiple urls as a string in one column 'urls'
urls <- "https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx"
id <- 1
df <- cbind(data.frame(urls), data.frame(id))
I now want to extract the complete domain matching "linkedin.com" and store it in a new column df$linkedin. And do the same for domain matching "medium.com" and store it in a new column df$medium. So the result would be basically
df$linkedin
"https://www.linkedin.com/xx/xxx-xx-xxx/"
df$medium
"https://medium.com/@xxxxx"
Somehow I have today a bad hair day and don't see an elegant solution. Would be awesome if you can help out me here :)
CodePudding user response:
I'll make this a little more interesting by making it two rows:
df2 <- structure(list(urls = c("https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx", "https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy"), id = c(1, 1)), row.names = c(NA, -2L), class = "data.frame")
df2
# urls id
# 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx 1
# 2 https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy 1
base R
baseurls <- c("linkedin", "medium")
newcols <- lapply(setNames(nm = baseurls), function(U) unlist(regmatches(df2$urls, gregexpr(paste0("http[^ ]*", U, "[^ ]*"), df2$urls))))
newcols
# $linkedin
# [1] "https://www.linkedin.com/xx/xxx-xx-xxx/" "https://www.linkedin.com/yy/yyy-yy-yyy/"
# $medium
# [1] "https://medium.com/@xxxxx" "https://medium.com/@yyyyy"
cbind(df2, data.frame(newcols))
# urls id linkedin medium
# 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https://medium.com/@xxxxx
# 2 https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy 1 https://www.linkedin.com/yy/yyy-yy-yyy/ https://medium.com/@yyyyy
tidyverse
## baseurls <- ...
library(dplyr)
library(stringr) # str_extract
library(purrr) # map_dfc
map_dfc(setNames(nm = baseurls), ~ str_extract(df2$urls, paste0("http[^ ]*", .x, "[^ ]*"))) %>%
bind_cols(df2, .)
# urls id linkedin medium
# 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https://medium.com/@xxxxx
# 2 https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy 1 https://www.linkedin.com/yy/yyy-yy-yyy/ https://medium.com/@yyyyy