Home > Mobile >  grepl for first column into last column: is this the most efficient
grepl for first column into last column: is this the most efficient


I have a list of names from different sources in one data set: one set is organized by FirstName LastName; the other has FullName. I want to see if the first name or the last name is within the full name column, and create a flag. Two questions:

First, I used this solution, but the resulting data doesn't have the right amount of rows, and I'm not sure how to get it to make a flag. I tried to turn it into an ifelse statement, but got another error. How do I fix this so if FirstName is in FullName, I flag True (or 1), otherwise I flag False (or 0)?

Second, I have a few million names, is this an efficient way to do things?

     FirstName = c("mary", "paul", "mother", "john", "red", "little", "king")
        LastName = c("berry", "hollywood", "theresa", "jones", "rover", "tim", "arthur")
        FullName = c("mary berry", "anthony horrowitz", "jennifer lawrence", "john jones", "red rover", "mick jagger", "king arthur")
    df = data.frame(FirstName, LastName, FullName)

#attempt 1 and error
    df$match_firstname <- df[mapply(grepl, df$FirstName, df$FullName), ]

Error in `$<-.data.frame`(`*tmp*`, match_firstname, value = list(FirstName = c("mary",  : 
  replacement has 4 rows, data has 7
#attempt 2 and error
    df$match_firstname <- ifelse(df[mapply(grepl, df$FirstName, df$FullName), ], 1, 0)

Error in ifelse(df[mapply(grepl, df$FirstName, df$FullName), ], 1, 0) : 
  'list' object cannot be coerced to type 'logical'

CodePudding user response:

Instead we could use str_detect which is vectorized for both pattern and string whereas in the Map/mapply code, it is looping over each row and thus could be less efficient

df %>% 
   filter(str_detect(FullName, FirstName))


 FirstName LastName    FullName
1      mary    berry  mary berry
2      john    jones  john jones
3       red    rover   red rover
4      king   arthur king arthur

If we want to add a new binary column, instead of filtering, convert the logical to binary with as.integer or

df <- df %>%
    mutate(match_firstname =  (str_detect(FullName, FirstName)))


   FirstName  LastName          FullName match_firstname
1      mary     berry        mary berry               1
2      paul hollywood anthony horrowitz               0
3    mother   theresa jennifer lawrence               0
4      john     jones        john jones               1
5       red     rover         red rover               1
6    little       tim       mick jagger               0
7      king    arthur       king arthur               1

The error in the OP's code is because we are assigning a subset of data into a new column in the original dataset which obviously result in length difference

df[mapply(grepl, df$FirstName, df$FullName), ]
FirstName LastName    FullName
1      mary    berry  mary berry
4      john    jones  john jones
5       red    rover   red rover
7      king   arthur king arthur

Similar to the previous solution, use

df$match_firstname <-  (mapply(grepl, df$FirstName, df$FullName))
  •  Tags:  
  • r
  • Related