I am trying to extract texts based on a match in a character column of a dataframe with a column of another dataframe. Here is an example of reproducible dataframes.
productlist <- data.frame(prod_tg=c('Milk', 'Soybean', 'Pig meat'),
nomencl=c('milk|SMP|dairy|MK', 'Soybean|Soyabean', 'Pigmeat|PK|Pork|pigmeat') )
tctdf <- data.frame(policy_label=c('Market Milk', 'dairy products', 'OCHA - MK', 'pig meat', 'Soybeans'))
I would like to match the strings case insensitive. In the productlist, I have included all entries in nomencl column by using '|' so that any match of these will go specific entry of prod_tg such as Milk, Pig meat, Soybean.
my expected dataframe would look like as:
finaldf = data.frame(policy_label=c('Market Milk', 'dairy products', 'OCHA - MK', 'pig meat', 'Soybeans'), prod_match=c('milk', 'dairy', 'MK','pig', 'Soybean'), product_tag=c('Milk', 'Milk', 'Milk', 'Pig meat', 'Soybean'))
I have been thinking of grepl
function in base R but open to any other function. Grateful for your suggestions.
CodePudding user response:
Here's a way using stringr::str_extract
library(stringr)
cbind(tctdf,t(sapply(tctdf$policy_label, function(x) {
v <- str_extract(x, regex(productlist$nomencl, ignore_case = TRUE))
c(prod_match = toString(na.omit(v)),
product_tag = toString(productlist$prod_tg[!is.na(v)]))
}))) |> `rownames<-`(NULL)
# policy_label prod_match product_tag
#1 Market Milk Milk Milk
#2 dairy products dairy Milk
#3 OCHA - MK MK Milk
#4 pigmeat pigmeat Pig meat
#5 Soybeans Soybean Soybean
data
Changed <=
to <-
for tctdf
and replaced 'pig meat'
to 'pigmeat'
so that it actually matches with productlist
.
tctdf <- data.frame(policy_label=c('Market Milk', 'dairy products',
'OCHA - MK', 'pigmeat', 'Soybeans'))