Assign value to a new column based on any of the multiple patterns from a vector-CodePudding

I have the following dataset called df:

structure(list(col1 = c("a b", "d e", "g f", "h j", "j k", "y z", 
"e f", "b c", "f g", "c d", "y z", "t u")), class = "data.frame", row.names = c(NA, 
-12L))

For this dataset, I have two vector with matches: A vector called matching1 <- c("a b", "b c", "c d") and a vector called matching2 <- c("c d","e f","f g"). In my df, I would like to create a new column and assign a value for a match. For the vector matching1, I would like to assign a value of 1, for the vector matching2 I would like to assign a value of 2 and for every string not matched a value of 3. Ideally, the value assignment for vector matching2 would not change the previous value assigment because the vector matching1 and matching2 both feature the string "d e". I know I can use:

matches1 <- paste0(na.omit(matching1), "", collapse = "|")

to create a collapsed vector with or and I have tried to combine it with case_when. However case_when does only allow single patterns and the list of potential matches in my original dataset is very long, so I would like to avoid spelling out every condition explicitely.

The output should look like this:

structure(list(col1 = c("a b", "d e", "g f", "h j", "j k", "y z", 
"e f", "b c", "f g", "c d", "y z", "t u"), col2 = c("1", "2", 
"3", "3", "3", "3", "2", "1", "2", "1", "3", "3")), class = "data.frame", row.names = c(NA, 
-12L))

CodePudding user response：

I think this does it:

edit: performing match2, to catch the situation where "c d" is in both, and match1 is preferred

df$ans<-ifelse(df$col1 %in% matching2, 2, 3)
df$ans<-ifelse(df$col1 %in% matching1, 1, df$ans)

Or pre-edit version with langtang's comment:

df$ans<-ifelse(df$col1 %in% matching1, 1, 3)
df$ans<-ifelse(df$col1 %in% setdiff(matching2, matching1), 2, df$ans)

CodePudding user response：

Here is an option using data.table, with a merge

library(data.table)

rbind(
  data.table(col1=matching1, col2=1),
  data.table(col1=setdiff(matching2,matching1), col2=2)
)[setDT(df), on="col1"][is.na(col2), col2:=3][]

Output:

      col1  col2
    <char> <num>
 1:    a b     1
 2:    d e     3
 3:    g f     3
 4:    h j     3
 5:    j k     3
 6:    y z     3
 7:    e f     2
 8:    b c     1
 9:    f g     2
10:    c d     1
11:    y z     3
12:    t u     3

CodePudding user response：

Two ways to solve your problem:

library(dplyr)

df %>% 
  mutate(col2 = case_when(col1 %in% matching1 ~ 1, 
                          col1 %in% matching2 ~ 2, 
                          TRUE ~ 3))

   col1 col2
1   a b    1
2   d e    3
3   g f    3
4   h j    3
5   j k    3
6   y z    3
7   e f    2
8   b c    1
9   f g    2
10  c d    1
11  y z    3
12  t u    3

Or

library(data.table)

setDT(df)[, col2 := fcase(col1 %in% matching1, 1, col1 %in% matching2, 2, default=3)]

      col1  col2
    <char> <num>
 1:    a b     1
 2:    d e     3
 3:    g f     3
 4:    h j     3
 5:    j k     3
 6:    y z     3
 7:    e f     2
 8:    b c     1
 9:    f g     2
10:    c d     1
11:    y z     3
12:    t u     3