I have a list of names from different sources in one data set: one set is organized by FirstName LastName; the other has FullName. I want to see if the first name or the last name is within the full name column, and create a flag. Two questions:
First, I used this solution, but the resulting data doesn't have the right amount of rows, and I'm not sure how to get it to make a flag. I tried to turn it into an ifelse statement, but got another error. How do I fix this so if FirstName is in FullName, I flag True (or 1), otherwise I flag False (or 0)?
Second, I have a few million names, is this an efficient way to do things?
FirstName = c("mary", "paul", "mother", "john", "red", "little", "king")
LastName = c("berry", "hollywood", "theresa", "jones", "rover", "tim", "arthur")
FullName = c("mary berry", "anthony horrowitz", "jennifer lawrence", "john jones", "red rover", "mick jagger", "king arthur")
df = data.frame(FirstName, LastName, FullName)
#attempt 1 and error
df$match_firstname <- df[mapply(grepl, df$FirstName, df$FullName), ]
Error in `$<-.data.frame`(`*tmp*`, match_firstname, value = list(FirstName = c("mary", :
replacement has 4 rows, data has 7
#attempt 2 and error
df$match_firstname <- ifelse(df[mapply(grepl, df$FirstName, df$FullName), ], 1, 0)
Error in ifelse(df[mapply(grepl, df$FirstName, df$FullName), ], 1, 0) :
'list' object cannot be coerced to type 'logical'
CodePudding user response:
Instead we could use str_detect
which is vectorized for both pattern
and string
whereas in the Map/mapply
code, it is looping over each row and thus could be less efficient
library(dplyr)
library(stringr)
df %>%
filter(str_detect(FullName, FirstName))
-output
FirstName LastName FullName
1 mary berry mary berry
2 john jones john jones
3 red rover red rover
4 king arthur king arthur
If we want to add a new binary column, instead of filter
ing, convert the logical to binary with as.integer
or
df <- df %>%
mutate(match_firstname = (str_detect(FullName, FirstName)))
-output
FirstName LastName FullName match_firstname
1 mary berry mary berry 1
2 paul hollywood anthony horrowitz 0
3 mother theresa jennifer lawrence 0
4 john jones john jones 1
5 red rover red rover 1
6 little tim mick jagger 0
7 king arthur king arthur 1
The error in the OP's code is because we are assigning a subset of data into a new column in the original dataset which obviously result in length difference
df[mapply(grepl, df$FirstName, df$FullName), ]
FirstName LastName FullName
1 mary berry mary berry
4 john jones john jones
5 red rover red rover
7 king arthur king arthur
Similar to the previous solution, use
df$match_firstname <- (mapply(grepl, df$FirstName, df$FullName))