Extract sentences from texts in data frame-CodePudding

I have a data frame with a column "text" and in each row of my data frame "text" contains several sentences (maybe only two, maybe 100 or more). Now I would like to analyze the text in every row of my data frame for specific keywords. If a keyword is found in the text of this row I would like to extract the sentences, which contain keywords, to a separate column, f.e.

needles = c("first", "hope", "analyze", "happy")

mydata <- data.frame(
  text = c("This is the first sentence. It is the beginning of this project",
           "My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
           "And this is the last sentence. Finally my work ends. I am really happy about that.",
           "These sentences do not contain any relevant information. There is no keyword. And it is not relevant."),
  findings = c("This is the first sentence.",
               "I hope this project will work fine. Then I will analyze the third sentence.",
               "I am really happy about that.",
               NA)
)

So column "text" contains the sentences I want to check for keywords, "findings" is the result I would like to have in the end.

Can anyone help me how to apply the solution for all rows of the data frame? Thank you!

CodePudding user response：

What about something like this:

find_sentence <- function(text, word){
  require(stringr)
  x <- c(str_split(text, "\\..", simplify=TRUE))
  inds <- which(str_detect(x, word))
  if(length(inds) > 0){
    list(x[inds])
  }else{
    list(NA)
  }
  
}

mydata %>% 
  rowwise %>% 
  mutate(res = find_sentence(text, "the")) %>% 
  unnest(res)

# # A tibble: 4 × 3
#   text                                                                                                    findings                     res            
#   <chr>                                                                                                   <chr>                        <chr>          
# 1 This is the first sentence. It is the beginning of this project                                         This is the first sentence.  This is the fi…
# 2 This is the first sentence. It is the beginning of this project                                         This is the first sentence.  It is the begi…
# 3 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence. I hope this project will wo… Then I will an…
# 4 And this is the last sentence. Finally my work ends. I am really happy about that.                      I am really happy about tha… And this is th…

This returns a new variable called res that has a different row for each occurrence of the keyword in a sentence. So, if two sentences contained the word (as in the first sentence in text), the text and findings columns will be replicated for each of the relevant sentences in res.

CodePudding user response：

With Base R,

lookup <- strsplit(as.character(mydata[,1]),"\\.")

out <- lapply(lookup,function(x) { 
                logic <- grepl(paste0(needles,collapse="|"),x)
                paste0(x[logic],collapse=".")


            })


data.frame(findings = do.call(rbind,out) )

gives,

#                                                                     findings
#1                                                  This is the first sentence
#2  I hope this project will work fine. Then I will analyze the third sentence
#3                                                I am really happy about that
#4

CodePudding user response：

This uses grep and a strsplit to get the matches.

mydata$findings <- sapply( strsplit( t(mydata), "\\. " ), function(x)
                     x[unlist( lapply( needles, function(y) grep(y, x) ) )] )

                                                                                                     text
1                                         This is the first sentence. It is the beginning of this project
2 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.
3                      And this is the last sentence. Finally my work ends. I am really happy about that.
4   These sentences do not contain any relevant information. There is no keyword. And it is not relevant.
                                                                     findings
1                                                  This is the first sentence
2 I hope this project will work fine, Then I will analyze the third sentence.
3                                               I am really happy about that.
4

CodePudding user response：

We can work with a nested list by splitting each row in text column and looking for the needles inside each resulting sentence of each row.

The reduce functions are to take levels of depth of the lists.

code:

library(tidyverse)


needles <- c("first", "hope", "analyze", "happy")

mydata <- data.frame(
  text = c(
    "This is the first sentence. It is the beginning of this project",
    "My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
    "And this is the last sentence. Finally my work ends. I am really happy about that.",
    "These sentences do not contain any relevant information. There is no keyword. And it is not relevant."
  ),
  findings = c(
    "This is the first sentence.",
    "I hope this project will work fine. Then I will analyze the third sentence.",
    "I am really happy about that.",
    NA
  )
)


(map(mydata$text, ~ str_split(., "\\.\\s")) %>%
  map_depth(2, function(row) map(needles, ~ str_subset(row, .))) %>%
  map_depth(2, ~ reduce(., c)) %>%
  map(~ reduce(., c)) %>%
  map_if(~ length(.) > 1, ~ reduce(., paste, sep = ". ")) %>%
  reduce(c) -> findings)
#> [1] "This is the first sentence"                                                 
#> [2] "I hope this project will work fine. Then I will analyze the third sentence."
#> [3] "I am really happy about that."

^{Created on 2021-11-26 by the reprex package (v2.0.1)}