Home > other >  R string match between two dataframe columns
R string match between two dataframe columns

Time:01-31

I am trying to extract texts based on a match in a character column of a dataframe with a column of another dataframe. Here is an example of reproducible dataframes.

productlist <- data.frame(prod_tg=c('Milk', 'Soybean', 'Pig meat'), 
                          nomencl=c('milk|SMP|dairy|MK', 'Soybean|Soyabean', 'Pigmeat|PK|Pork|pigmeat') )

tctdf <- data.frame(policy_label=c('Market Milk', 'dairy products', 'OCHA - MK', 'pig meat', 'Soybeans'))

I would like to match the strings case insensitive. In the productlist, I have included all entries in nomencl column by using '|' so that any match of these will go specific entry of prod_tg such as Milk, Pig meat, Soybean.

my expected dataframe would look like as:

finaldf = data.frame(policy_label=c('Market Milk', 'dairy products', 'OCHA - MK', 'pig meat', 'Soybeans'), prod_match=c('milk', 'dairy', 'MK','pig', 'Soybean'), product_tag=c('Milk', 'Milk', 'Milk', 'Pig meat', 'Soybean'))

I have been thinking of grepl function in base R but open to any other function. Grateful for your suggestions.

CodePudding user response:

Here's a way using stringr::str_extract

library(stringr)

cbind(tctdf,t(sapply(tctdf$policy_label, function(x) {
  v <- str_extract(x, regex(productlist$nomencl, ignore_case = TRUE))
  c(prod_match = toString(na.omit(v)), 
    product_tag = toString(productlist$prod_tg[!is.na(v)]))
}))) |> `rownames<-`(NULL)

#    policy_label prod_match product_tag
#1    Market Milk       Milk        Milk
#2 dairy products      dairy        Milk
#3      OCHA - MK         MK        Milk
#4        pigmeat    pigmeat    Pig meat
#5       Soybeans    Soybean     Soybean

data

Changed <= to <- for tctdf and replaced 'pig meat' to 'pigmeat' so that it actually matches with productlist.

tctdf <- data.frame(policy_label=c('Market Milk', 'dairy products', 
                                   'OCHA - MK', 'pigmeat', 'Soybeans'))
  •  Tags:  
  • Related