I have a data frame (df1) with one column, with each entry/row/observation consisting of a long string of text (df1$text). In a separate data frame (df2) I have one column, with each entry/row/observation consisting of a single name (df2$name).
I would like to note for each row in df1 which of the names in df2$name appear in the text. Ideally, I'd like to store whether a name appears in df1$text as a 1/0 value that is stored in a new column in df1 (i.e. dummy variables), that is named for that name:
> df1
text
1 ...
2 ...
3 ...
4 ...
> df2
name
1 John
2 James
3 Jerry
4 Jackson
After code is executed:
> df1
text John James Jerry Jackson
1 ... 1 1 0 1
2 ... 0 0 0 1
3 ... 1 1 0 1
4 ... 1 0 0 1
Is there a way to do this without using a for loop? my text fields are long and I have many observations in both df1 and df2.
CodePudding user response:
I'm not sure that you did not provide reproducible example. So, I made dummy data df1
myself like
df1 <- data.frame(
text = c("John James John Jakson",
"Jackson abcd zxcv",
"John Jackson James Jerr aa",
"John Jackson JAJAJAJA")
)
text
1 John James John Jakson
2 Jackson abcd zxcv
3 John Jackson James Jerr aa
4 John Jackson JAJAJAJA
Then, you may try using dplyr
like
library(dplyr)
df1 %>%
mutate(John = as.numeric(grepl("John", text)),
James = as.numeric(grepl("James", text)),
Jerry = as.numeric(grepl("Jerry", text)),
Jackson = as.numeric(grepl("Jackson", text))
)
text John James Jerry Jackson
1 John James John Jakson 1 1 0 0
2 Jackson abcd zxcv 0 0 0 1
3 John Jackson James Jerr aa 1 1 0 1
4 John Jackson JAJAJAJA 1 0 0 1
CodePudding user response:
A base R option using lapply
-
df1[df2$name] <- lapply(df2$name, function(x) (grepl(x, df1$text)))
If you want the match to be case insensitive then add ignore.case = TRUE
in grepl
.