Home > Mobile >  Return string pattern match plus text before and after pattern
Return string pattern match plus text before and after pattern

Time:10-03

Suppose I have diary entries from 5 people and I want to determine if they mention any food-related key words. I want an output of the key word with a window of one word before and after to provide context before determining if they are food-related.

The search should be case-insensitive, and it's ok if the key word is embedded in another word. E.g., If a key word is "rice", I want to output to include "price".

Assume I have the following data:

foods <- c('corn', 'hot dog', 'ham', 'rice')
df <- data.frame(id = 1:5,
                 diary = c('I ate rice and corn today',
                            'Sue ate my corn.',
                            'He just hammed it up',
                            'Corny jokes are my fave',
                            'What is the price of milk'))

The output I'm looking for is:

|ID|Output                          |
|--|--------------------------------|
|1 |"ate rice and", "and corn today"|
|2 |"my corn"                       |
|3 |"just hammed it"                |
|4 |"Corny jokes"                   |
|5 |"the price of"                  |

I've used strings::stri_detect but the output includes the entire diary entry.

I've used strings::stri_extract but I can't find a way to include one word before and after the key word.

CodePudding user response:

We can collapse the regex and extract the words ("\w ") that preceed or follow the collapsed pattern. The regex() function allows the argument ignore_case = TRUE, which is very useful for case-insensitive matching. We may have to include optional word boundaries arount the collapsed pattern, so both rice and price, ham or hammed are included. I made some small changes to the data to make it more illustrative.

I posted two answers. One will exclude matches inside larger words, such as "hammed" or "price", so non-food matches will return empty strings. The other is more inclusive.

library(dplyr)
library(stringr)

df %>% mutate(Output = str_extract_all (diary,
                                        regex(paste0("\\w \\s (",
                                                     paste("\\b",foods, "\\b", collapse = "|", sep=''),
                                                     ")\\s \\w "),
                                              ignore_case=TRUE)))

output 1

  id                        diary             Output
1  1    I ate rice and corn today       ate rice and
2  2             Sue ate my corn.                  
3  3         He just hammed it up                  
4  4      Corny jokes are my fave                  
5  5    What is the price of milk                  
6  6 I like to eat ham sandwiches eat ham sandwiches

solution 2

df %>% mutate(Output = str_extract_all (diary,
                                        regex(paste0("\\w \\s (",
                                                     paste("\\b\\w*",foods, "\\w*\\b", collapse = "|", sep=''),
                                                     ")\\s \\w "),
                                              ignore_case=TRUE)))

  id                        diary             Output
1  1    I ate rice and corn today       ate rice and
2  2             Sue ate my corn.                  
3  3         He just hammed it up     just hammed it
4  4      Corny jokes are my fave                  
5  5    What is the price of milk       the price of
6  6 I like to eat ham sandwiches eat ham sandwiches

data

foods <- c('corn', 'hot dog', 'ham', 'rice')
df <- data.frame(id = 1:6,
                 diary = c('I ate rice and corn today',
                           'Sue ate my corn.',
                           'He just hammed it up',
                           'Corny jokes are my fave',
                           'What is the price of milk',
                           'I like to eat ham sandwiches'))

FINAL EDIT

I figured out the problem with "corn", and handled the multiple matches issue. We have to do a nested loop. First loop through all entries in "diary"(outer loop). Then, in the inner loop, loop through all "foods", and call "str_extract_all", with the appropriate regex. The initial regex required a food word be preceded or followed by another word, so foods at sentence boundaries were not matched. I included a ? quantifier (0 or 1 matches) around the surrounding words (\\w \\s ) so it all works smoothly. The only issue left is the order of the matches in multiple matches, it is still odd. But I think the solution is fine now.

df %>% mutate(output=map(df$diary,
                         ~map(foods, \(x) str_extract_all(.x,
                                                          regex(paste0("(\\w \\s )?(",
                                                                       paste("\\b\\w*", x, "\\w*\\b", collapse = "|", sep=''),
                                                                       ")(\\s \\w )?"),
                                                                ignore_case=TRUE))))%>%
                      map(unlist))

  id                        diary                       output
1  1    I ate rice and corn today and corn today, ate rice and
2  2             Sue ate my corn.                      my corn
3  3         He just hammed it up               just hammed it
4  4      Corny jokes are my fave                  Corny jokes
5  5    What is the price of milk                 the price of
6  6 I like to eat ham sandwiches           eat ham sandwiches

CodePudding user response:

Not entirely sure whether that's 100% helpful but worth a try:

First, define your keywords as a case-insensitive alternation pattern:

patt <- paste0("(?i)(", paste0(foods, collapse = "|"), ")")

Then extract the word on the left, the keyword itself called node, and the word on the right using stringr's function str_extract_all:

library(stringr)
df1 <- data.frame(
  left = unlist(str_extract_all(gsub("[.,!?]", "", df$diary), paste0("(?i)(\\S |^)(?=\\s?", patt, ")"))),
  node = unlist(str_extract_all(gsub("[.,!?]", "", df$diary), patt)),
  right = unlist(str_extract_all(gsub("[.,!?]", "", df$diary), paste0("(?<=\\s?", patt, "\\s?)(\\S |$)")))
  )

Result:

df1
  left node right
1  ate rice   and
2  and corn today
3   my corn      
4 just  ham   med
5      Corn     y
6    p rice    of

While this is not exactly the expected output it may still serve your purpose iff that purpose is to check whether a match is indeed a keyword. In lines 5 and 6, for example, the view provided by df1 immediately makes it clear that these are not keyword matches.

  •  Tags:  
  • r
  • Related