I am trying to filter out relevant rowas based on the presence or existence of a string or part/element of a string in R. Following is the example:
colA colb flag
New York Metropolitan Area New York Yes
New York Metropolitan Area York Yes
New York Metropolitan Area New York Area Yes
New York Metropolitan Area Los Angeles No
Things I have tried till now:
- Where 2 different dataframes are present
df1<- df1 %>% fuzzy_inner_join(df2, by = c("colA" = "colB"), match_fun = str_detect)
This option fails due to paranthesis and other special characters, cleaning them all up also did not help.
- I joined the 2 dataframes based on an upper level hierarchay to limit the rows and created a dataframe df
df[, "lookup"] <- gsub(" ", "|", df[,"colB"])
df[,"flag"] <- mapply(grepl, df[,"lookup"], df[,"colA"])
Results not satisfactory as only limted rows are filtered.
Thank you in advance.
CodePudding user response:
Here is a base R solution.
The anonymous lambda function \(x, y)
was introduced in R 4.1.0, for older versions of R use function(x, y)
.
pattern <- gsub(" ", "|", df1$colb)
i <- mapply(\(x, y)grepl(x, y), pattern, df1$colA)
df1$flag <- c("No", "Yes")[i 1L]
df1
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
#4 New York Metropolitan Area Los Angeles No
To remove the rows not matching the patterns:
df1[i, ]
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
Data
df1 <-
structure(list(colA = c("New York Metropolitan Area",
"New York Metropolitan Area", "New York Metropolitan Area",
"New York Metropolitan Area"), colb = c("New York", "York",
"New York Area", "Los Angeles"), flag = c("Yes", "Yes", "Yes",
"No")), row.names = c(NA, -4L), class = "data.frame")
CodePudding user response:
If I've understood your question correctly, you're trying to match partial strings and get a new column indicating the match:
df1 <- data.frame(colA = rep("New York Metropolitan Area ", 4),
colb = c("New York", "York", "New York Area", "Los Angeles") )
My first attempt was a simple str_detect
but this tries to match the whole string in colb
in colA
:
df3 = df1%>%
mutate(flag = str_detect(colA, colb))
> df3
colA colb flag
1 New York Metropolitan Area New York TRUE
2 New York Metropolitan Area York TRUE
3 New York Metropolitan Area New York Area FALSE
4 New York Metropolitan Area Los Angeles FALSE
This is not quite right; though in this example you could just add df1$colb = gsub("Area", "", df1$colb )
first.
alternatively:
library(dplyr) # for pipe
library(stringr) # for str_detect
library(tidyr) # for separate
#separate colb into 3 columns (called b1,b2 and b3) with separate words (can be increased if more words)
df1 = df1 %>% separate(col = colb, c("b1","b2","b3"))
# detect contents of columns b1, b2 or b3 in colA and create new column with logical value
df2 = df1%>%
mutate(flag = str_detect(colA, b1)|
str_detect(colA, b2)|
str_detect(colA, b3))
This gives output
> df2
colA b1 b2 b3 flag
1 New York Metropolitan Area New York <NA> TRUE
2 New York Metropolitan Area York <NA> <NA> TRUE
3 New York Metropolitan Area New York Area TRUE
4 New York Metropolitan Area Los Angeles <NA> NA