Home > Enterprise >  Detecting key words across multiple columns and flagging them each in new columns
Detecting key words across multiple columns and flagging them each in new columns

Time:09-16

I have this type of data:

set.seed(123)
df <- data.frame(
  v1 = sample(LETTERS[1:10], 5),
  v2 = sample(LETTERS[1:10], 5),
  v3 = sample(LETTERS[1:10], 5),
  v4 = sample(LETTERS[1:10], 5)      
)

as well as a number of key words:

keys <- c("A", "C", "F", "H")

I want to (i) detect which keyword is present in each column of df and (ii) create new columns for each of the keys, recording with 1if that key is present. I've been using a for loop so far, which, however, does not work:

library(stringr)
for(i in keys){
  df[i] <-  str_detect(apply(df, 1, paste0, collapse = " "), keys[i])
}
df
  v1 v2 v3 v4  A  C  F  H
1  C  A  J  I NA NA NA NA
2  H  E  E  C NA NA NA NA
3  D  H  F  A NA NA NA NA
4  G  D  I  J NA NA NA NA
5  F  C  A  F NA NA NA NA

I don't know why it doesn't work, as the same logic applied to a single key does work:

 str_detect(apply(df, 1, paste0, collapse = " "), keys[1])
[1] 1 0 1 0 1

I'd be grateful for advice on how the for loop must be tweaked but also for an entirely different solution to the problem (e.g., dplyr).

CodePudding user response:

A base solution:

cbind(df,  sapply(keys, grepl, x = do.call(paste, df)))

#   v1 v2 v3 v4 A C F H
# 1  C  E  J  I 0 1 0 0
# 2  J  D  E  J 0 0 0 0
# 3  B  F  C  E 0 1 1 0
# 4  H  H  H  C 0 1 0 1
# 5  F  A  A  B 1 0 1 0

CodePudding user response:

In your for loop you are indexing on both the index and the name. Test keys["A"], it'll return NA. You can use which to avoid this:

library(stringr)
for(i in keys){
  df[i] <- ifelse(str_detect(apply(df, 1, paste0, collapse = " "), keys[which(keys == i)]), 1, 0)
}

output

  v1 v2 v3 v4 A C F H
1  C  E  J  I 0 1 0 0
2  J  D  E  J 0 0 0 0
3  B  F  C  E 0 1 1 0
4  H  H  H  C 0 1 0 1
5  F  A  A  B 1 0 1 0

Another option would be to index on the position (1, 2, 3, 4):

for(i in seq_along(keys)){
  df[keys[i]] <- ifelse(str_detect(apply(df, 1, paste0, collapse = " "), keys[i]), 1, 0)
}

CodePudding user response:

Using dplyr's if_any:

library(stringr)
library(dplyr)
library(purrr)
bind_cols(df,
          map_dfc(keys, \(x) df %>% 
                    transmute(!!(x) :=  (if_any(v1:v4, \(y) str_detect(x, y))))))

output

  v1 v2 v3 v4 A C F H
1  C  E  J  I 0 1 0 0
2  J  D  E  J 0 0 0 0
3  B  F  C  E 0 1 1 0
4  H  H  H  C 0 1 0 1
5  F  A  A  B 1 0 1 0
  • Related