a dataframe as list:
dfcheck <- data.frame(status = c("open/close", "close", "open"), stock = c("company energy","goods and books","other"), name = c("amazon1;google1","google3;yahoo1","yahoo2;amazon2;google2"))
And an input dataframe like this:
dfdata <- data.frame(id = c("id1", "id2", "id3"), title1 = c("amazon1","google1","yahoo1"), title2 = c("yahoo2",NA,"amazon2"))
How is it possible to produce a dataframe with columns based the previous list:
Expected output: dfdata <- data.frame(id = c("id1", "id2", "id3"), title1 = c("amazon1","google1","yahoo1"), title2 = c("yahoo2",NA,"amazon2"), status1 = c("open/close","open/close","close"), stock1 = c("company energy","company energy","goods and books"), status2 = c("open",NA,"open"), stock2 = c("other",NA,"other"))
id title1 title2 status1 stock1 status2 1 id1 amazon1 yahoo2 open/close company energy open 2 id2 google1 <NA> open/close company energy <NA> 3 id3 yahoo1 amazon2 close goods and books open stock2 1 other 2 <NA> 3 other
This dataframe checks in dfdata in every column, expect the first id column, if any of the values in dfcheck dataframe exist and creates two new columns with the status and stock of dfcheck. From the dfcheck the column name has more than one values separated by ";"
CodePudding user response:
Libraries:
library(dplyr)
library(stringr)
library(tidyr)
First you'd need to tidy your dfcheck
data.frame:
dfcheck_tidy <- dfcheck %>%
mutate(name = str_split(name, ";")) %>%
unnest(name)
(I'm not using tidyr::separate
to do this as it seems from your example that you can have a variable length of names separated by ";".)
And now you can perform the two joins:
dfdata %>%
left_join(dfcheck_tidy,
by = c("title1" = "name")) %>%
left_join(dfcheck_tidy,
by = c("title2" = "name"),
suffix = c("1", "2"))
# id title1 title2 status1 stock1 status2 stock2
# 1 id1 amazon1 yahoo2 open/close company energy open other
# 2 id2 google1 <NA> open/close company energy <NA> <NA>
# 3 id3 yahoo1 amazon2 close goods and books open other
CodePudding user response:
Here's an approach using regex_join()
from the fuzzyjoin
package.
library(dplyr)
library(fuzzyjoin)
regex_right_join(dfcheck, dfdata, by = c(name = "title1")) %>%
regex_right_join(dfcheck, ., by = c(name = "title2")) %>%
select(!contains("name")) %>%
relocate(id, title1, title2)
id title1 title2 status.x stock.x status.y stock.y
1 id1 amazon1 yahoo2 open other open/close company energy
2 id2 google1 <NA> <NA> <NA> open/close company energy
3 id3 yahoo1 amazon2 open other close goods and books