Home > other >  R - Identify words in a comma-seperated list for a specific column in a dataframe
R - Identify words in a comma-seperated list for a specific column in a dataframe

Time:11-14

I have a specific column in a dataframe, where each cell of that column has a list of comma-seperated words without spaces. I am wanting to pick out the presence of (either of) two specific words in each cell, and when that presence is detected then I want to create a new column which is populated by 'yes' else the cell is blank.

So for instance, the two words that I want to detect the presence of are 'test word 1' and 'test word 2' (note the presence of spaces in each of those two test words). Each cell in the column will be of the form 'word1,word2, ......' (note the absence of the spaces) which may or may not contain the two test words.

x_test <- c('test word 1','test word 2') 

I have tried a couple of methods. but all of them seem either to fail to detect the test words because the whole cell is being interpreted as a single large word,

A <- strsplit(current_cell, split=",")
B <- c(unlist(strsplit(A, split=",")) 
C <- lapply( df['col_name'],  B)

or alternatively the method seems to strip apart the whole column and create a single giant list out of it.

C <- gsub(",([A-Za-z])", ", \\1", df$col_name)

I also tried the intersection of two vectors, but that did not work either.

Whatever method is used to obtain C, we then want to ask something like

df$test <- ifelse( C %in% x_test , 'yes', '')

How would I achieve all this in R without running into the above-mentioned issues?

(Here is an example of the specific column that I would like to operate on. In this case, the two expressions that I am targeting are 'non-essential businesses' and 'all businesses'):

entertainment,recreation,offices,tourism
entertainment,venues,non-essential businesses
hospitality,entertainment,fitness,beauty
all businesses
hospitality,entertainment,fitness
entertainment,religion,non-essential businesses

CodePudding user response:

Let's assume you have this data:

df <- data.frame(str = c("entertainment,recreation,offices,tourism",
"entertainment,venues,non-essential businesses",
"hospitality,entertainment,fitness,beauty",
"all businesses",
"hospitality,entertainment,fitness",
"entertainment,religion,non-essential businesses"))

and these test words:

test_words <- c("non-essential businesses", "all businesses")

Then you can convert the test words into an alternation pattern:

test_words_patt <- paste0(test_words, collapse = "|")

and input that pattern into greplto detect the presence of any of the test words in str, and ifelse in order to populate the new column with "yes" and "":

library(dplyr)
df %>%
  mutate(test_word_is_present = ifelse(grepl(test_words_patt, str),
                                       "yes", 
                                       ""))
                                              str test_word_is_present
1        entertainment,recreation,offices,tourism                     
2   entertainment,venues,non-essential businesses                  yes
3        hospitality,entertainment,fitness,beauty                     
4                                  all businesses                  yes
5               hospitality,entertainment,fitness                     
6 entertainment,religion,non-essential businesses                  yes

In base R:

df$test_word_is_present <- ifelse(grepl(test_words_patt, df$str),
                                        "yes", 
                                        "")
  • Related