I have two data sources / columns A & B, both having company names, possibly in a slightly different format. The data sources are of different sizes. For example:
A:
Company |
---|
Factory |
Shop |
Restaurant |
B:
Company |
---|
Factory Inc |
Shop Ltd |
Service |
Laundry |
How could I look up row-wise, if A company exists, perhaps as a substring in a longer column in the data set B?
CodePudding user response:
One approach is to use sapply
and str_detect
sapply(df1$Company, function(x) str_detect(df2$Company, x))
Factory Shop Restaurant
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
CodePudding user response:
If you want to know which element of A is contained within which element of B, you can use the base function grepl
with option fixed = TRUE
. This option avoids unexpected behaviour if your text contains characters that are reserved in regular expressions, such as .
, ?
, (
or similar.
A <- c("Factory", "Shop", "Restaurant")
B <- c("Factory Inc", "Shop Ltd", "Service", "Laundry")
sapply(A, \(string) grepl(string, B, fixed = TRUE))
Factory Shop Restaurant
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
However, if you only want to know if a row in A is present anywhere in B, you can add any
to the expression...
sapply(A, \(string) any(grepl(string, B, fixed = TRUE)))
Factory Shop Restaurant
TRUE TRUE FALSE
...or you can "cheat" by turning B into one big string (potentially using unique
to account for duplication). Use a separator that would not be present elsewhere to avoid creating new "words" at the boundaries.
Bcheck <- paste(unique(B), collapse="|")
Bcheck
[1] "Factory Inc|Shop Ltd|Service|Laundry"
sapply(A, \(string) grepl(string, Bcheck, fixed = TRUE))
Factory Shop Restaurant
TRUE TRUE FALSE
The latter is a bit faster according to microbenchmark.