Home > Enterprise >  checking whether a word from a vector appears in the same row in different columns of a data frame
checking whether a word from a vector appears in the same row in different columns of a data frame

Time:12-25

I am trying to troubleshoot my data, and check whether a certain name appears in two different columns in the same row (same observation):

df1 <- data.frame(
  text1 = c("John Jay Jakson",
            "John Jay Jakson",
            "John Jay Jakson",
            "John Jack Jakson"), 
  text2 = c("Jerry Jack Jameson", 
            "Jerry Jack Jameson", 
            "Jerry Jack Jameson", 
            "Jerry Jack Jameson"))

df2 <- data.frame(
  names = c("John", "Jay", "Jackson", "Jerry", "Jack", "Jameson"))

The code I've come up with is the following

data.check = sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2))==TRUE)

or alternatively:

which(sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2))==TRUE))

But these are not the best ways to sift through the data. Instead, I want to have a new column in df1, df1$check, that holds a 1/0 based on whether each row has the same name in that row under df1$text1 and df1$text2.

I know that assigning this code to a new column won't work:

df1$check = sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2))==TRUE)

it gives me false for the 4th row, which should be a true.

Any help appreciated, thank you.

CodePudding user response:

I include a dplyr approach:

# import required libraries
library(dplyr)
library(stringr)

# create your data (I added two more rows)
df1 <- data.frame(
  text1 = c("John Jay Jakson",
            "John Jay Jakson",
            "John Jay Jakson",
            "John Jack Jakson","Peter","John Snow"),
  text2 = c("Jerry Jack Jameson", 
            "Jerry Jack Jameson", 
            "Jerry Jack Jameson", 
            "Jerry Jack Jameson","Peter", "Clay Snow"))

df2 <- data.frame(
  names = c("John", "Jay", "Jackson", "Jerry", "Jack", "Jameson"))

# optionally convert df2 to vector or list
v2<-as.vector(df2$names)

#use of str_detect() to look for the string
# use of case_when() that works like if/else
# by including the | operator between the different names
# create a new column called check to store 1s and 0s  

df1<-df1%>%
  mutate(check=case_when(str_detect(text1,paste(v2, collapse = "|"))==TRUE & str_detect(text2,paste(v2, collapse = "|"))==TRUE ~"1",
                         TRUE~"0"))

CodePudding user response:

The output from sapply in the OP's code returns a logical matrix.

> sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2)))
      John   Jay Jackson Jerry  Jack Jameson
[1,] FALSE FALSE   FALSE FALSE FALSE   FALSE
[2,] FALSE FALSE   FALSE FALSE FALSE   FALSE
[3,] FALSE FALSE   FALSE FALSE FALSE   FALSE
[4,] FALSE FALSE   FALSE FALSE  TRUE   FALSE

Each column of the matrix should be converged to a single logical value to create a vector. We may wrap with rowSums on the logical matrix and then convert the row wise sum to logical vector (> 0) and coerce it back to binary ( - TRUE -> 1, FALSE -> 0)

df1$check <-  (rowSums(sapply(df2$names, function(x) 
     (grepl(x, df1$text1) & grepl(x, df1$text2)))) > 0)
df1$check
[1] 0 0 0 1

Or another option is to loop with lapply, return a list and use Reduce with | to return a vector

df1$check <-  (Reduce(`|`, lapply(df2$names, function(x)
      (grepl(x, df1$text1) & grepl(x, df1$text2)))))
  • Related