I checked whether the brands of the data frame "df1"
brands
1 Nike
2 Adidas
3 D&G
are to be found in the elements of the following column of the data frame "df2"
statements
1 I love Nike
2 I don't like Adidas
3 I hate Puma
For this I use the code:
subset_df2 <- df2[grepl(paste(df1$brands, collapse="|"), ignore.case=TRUE, df2$statements), ]
The code works and I get a subset of df2 containing only the lines with the desired brands:
statements*
1 I love Nike
2 I don't like Adidas
Is there also a way to display which element of the cells from df2$statements exactly matches with df1$brands? For instance, a vector like [Nike, Adidas]. So, I only want to get the Nike and Adidas elements as my output and not the whole statement.
Many thanks in advance!
CodePudding user response:
brands <- c("nike", "adidas", "d&g") # lower-case here
text <- c("I love Nike", "I love Adidas")
ptns <- paste(brands, collapse = "|")
ptns
# [1] "nike|adidas|d&g"
text2 <- text[NA]
text2[grepl(ptns, text, ignore.case=TRUE)] <- gsub(paste0(".*(", ptns, ").*"), "\\1", text, ignore.case = TRUE)
text2
# [1] "Nike" "Adidas"
The pre-assignment of text[NA]
is because gsub
will make no change if the pattern is not found. I'm using text[NA]
, but we could also use rep(NA_character_, length(text))
, it's the same effect.
If you need multiple matches per text
, then perhaps
brands <- c("Nike", "Adidas", "d&g")
text <- c("I love nike", "I love Adidas and Nike")
ptns <- paste(brands, collapse = "|")
gre <- gregexpr(ptns, text, ignore.case = TRUE)
sapply(regmatches(text, gre), paste, collapse = ";")
# [1] "nike" "Adidas;Nike"