Filter rows based on presence of a string element from another column-CodePudding

I am trying to filter out relevant rowas based on the presence or existence of a string or part/element of a string in R. Following is the example:

colA                                      colb                           flag
New York Metropolitan Area                New York                       Yes 
New York Metropolitan Area                York                           Yes
New York Metropolitan Area                New York Area                  Yes
New York Metropolitan Area                Los Angeles                    No

Things I have tried till now:

Where 2 different dataframes are present

df1<- df1 %>% fuzzy_inner_join(df2, by = c("colA" = "colB"), match_fun = str_detect)

This option fails due to paranthesis and other special characters, cleaning them all up also did not help.

I joined the 2 dataframes based on an upper level hierarchay to limit the rows and created a dataframe df

df[, "lookup"] <- gsub(" ", "|", df[,"colB"])

df[,"flag"] <- mapply(grepl, df[,"lookup"], df[,"colA"])

Results not satisfactory as only limted rows are filtered.

Thank you in advance.

CodePudding user response：

Here is a base R solution.
The anonymous lambda function \(x, y) was introduced in R 4.1.0, for older versions of R use function(x, y).

pattern <- gsub(" ", "|", df1$colb)
i <- mapply(\(x, y)grepl(x, y), pattern, df1$colA)
df1$flag <- c("No", "Yes")[i   1L]

df1
#                        colA          colb flag
#1 New York Metropolitan Area      New York  Yes
#2 New York Metropolitan Area          York  Yes
#3 New York Metropolitan Area New York Area  Yes
#4 New York Metropolitan Area   Los Angeles   No

To remove the rows not matching the patterns:

df1[i, ]
#                        colA          colb flag
#1 New York Metropolitan Area      New York  Yes
#2 New York Metropolitan Area          York  Yes
#3 New York Metropolitan Area New York Area  Yes

Data

df1 <-
structure(list(colA = c("New York Metropolitan Area", 
"New York Metropolitan Area", "New York Metropolitan Area", 
"New York Metropolitan Area"), colb = c("New York", "York", 
"New York Area", "Los Angeles"), flag = c("Yes", "Yes", "Yes", 
"No")), row.names = c(NA, -4L), class = "data.frame")

CodePudding user response：

If I've understood your question correctly, you're trying to match partial strings and get a new column indicating the match:

df1 <- data.frame(colA = rep("New York Metropolitan Area ", 4),
                  colb = c("New York", "York", "New York Area", "Los Angeles") )

My first attempt was a simple str_detect but this tries to match the whole string in colb in colA:

df3 = df1%>%
  mutate(flag =  str_detect(colA, colb))

> df3
                         colA          colb  flag
1 New York Metropolitan Area       New York  TRUE
2 New York Metropolitan Area           York  TRUE
3 New York Metropolitan Area  New York Area FALSE
4 New York Metropolitan Area    Los Angeles FALSE

This is not quite right; though in this example you could just add df1$colb = gsub("Area", "", df1$colb ) first.

alternatively:

library(dplyr) # for pipe
library(stringr) # for str_detect
library(tidyr) # for separate

#separate colb into 3 columns (called b1,b2 and b3) with separate words (can be increased if more words)
df1 = df1 %>% separate(col = colb, c("b1","b2","b3")) 

# detect contents of columns b1, b2 or b3 in colA and create new column with logical value
df2 = df1%>%
    mutate(flag = str_detect(colA, b1)| 
                  str_detect(colA, b2)|
                  str_detect(colA, b3))

This gives output

> df2
                         colA   b1      b2   b3 flag
1 New York Metropolitan Area   New    York <NA> TRUE
2 New York Metropolitan Area  York    <NA> <NA> TRUE
3 New York Metropolitan Area   New    York Area TRUE
4 New York Metropolitan Area   Los Angeles <NA>   NA