I am trying to troubleshoot my data, and check whether a certain name appears in two different columns in the same row (same observation):
df1 <- data.frame(
text1 = c("John Jay Jakson",
"John Jay Jakson",
"John Jay Jakson",
"John Jack Jakson"),
text2 = c("Jerry Jack Jameson",
"Jerry Jack Jameson",
"Jerry Jack Jameson",
"Jerry Jack Jameson"))
df2 <- data.frame(
names = c("John", "Jay", "Jackson", "Jerry", "Jack", "Jameson"))
The code I've come up with is the following
data.check = sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2))==TRUE)
or alternatively:
which(sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2))==TRUE))
But these are not the best ways to sift through the data. Instead, I want to have a new column in df1, df1$check, that holds a 1/0 based on whether each row has the same name in that row under df1$text1 and df1$text2.
I know that assigning this code to a new column won't work:
df1$check = sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2))==TRUE)
it gives me false for the 4th row, which should be a true.
Any help appreciated, thank you.
CodePudding user response:
I include a dplyr approach:
# import required libraries
library(dplyr)
library(stringr)
# create your data (I added two more rows)
df1 <- data.frame(
text1 = c("John Jay Jakson",
"John Jay Jakson",
"John Jay Jakson",
"John Jack Jakson","Peter","John Snow"),
text2 = c("Jerry Jack Jameson",
"Jerry Jack Jameson",
"Jerry Jack Jameson",
"Jerry Jack Jameson","Peter", "Clay Snow"))
df2 <- data.frame(
names = c("John", "Jay", "Jackson", "Jerry", "Jack", "Jameson"))
# optionally convert df2 to vector or list
v2<-as.vector(df2$names)
#use of str_detect() to look for the string
# use of case_when() that works like if/else
# by including the | operator between the different names
# create a new column called check to store 1s and 0s
df1<-df1%>%
mutate(check=case_when(str_detect(text1,paste(v2, collapse = "|"))==TRUE & str_detect(text2,paste(v2, collapse = "|"))==TRUE ~"1",
TRUE~"0"))
CodePudding user response:
The output from sapply
in the OP's code returns a logical matrix
.
> sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2)))
John Jay Jackson Jerry Jack Jameson
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE TRUE FALSE
Each column of the matrix
should be converged to a single logical value to create a vector. We may wrap with rowSums
on the logical matrix and then convert the row wise sum to logical vector (> 0
) and coerce it back to binary (
- TRUE
-> 1, FALSE
-> 0)
df1$check <- (rowSums(sapply(df2$names, function(x)
(grepl(x, df1$text1) & grepl(x, df1$text2)))) > 0)
df1$check
[1] 0 0 0 1
Or another option is to loop with lapply
, return a list
and use Reduce
with |
to return a vector
df1$check <- (Reduce(`|`, lapply(df2$names, function(x)
(grepl(x, df1$text1) & grepl(x, df1$text2)))))