Home > Software design >  How to extract multiple sentences from a text containing a keyword using REGEX and Stringr in R?
How to extract multiple sentences from a text containing a keyword using REGEX and Stringr in R?

Time:09-14

I have this, e.g code in R:

allfiles<-list(
    c("The Journal of Technology Transfer  45:806–819 https://doi.org/10.1007/s10961-018-9699-8 New companies and institutions, public funding incentives in the next section. Researchers, making a academic career. Unger actors, identifying KT: • Public businesses;  provision of funding for higher education;  defining for funding and planning activity. Governance of HEIs is increasingly and competition.  They profiling, i.e., institution’s funding schemes.  Several recent in design and targets. Instruments funding include elements. This, on one hand, is intermediar- ies. On the other, with universities private sector. Competitive funding from third parties has different implications depending on the source of the funds. Typically, institutional funding provides stable financial resources which can be a basis 13 814 M. Unger et all for strategic planning of long-term research activities."), 
    c("The Journal of Technology Transfer  45:806–819 https://doi.org/10.1007/s10961-018-0001-8 Performance-based and competitive funding instead for HEIs to activities. This second funding mechanism conditions for research careers.  Such schemes can improve industry. Third party partners, academic and private sectors. A well-known example of funding has both research pro- ductivity and quality. In addition case, recent  funding for research has increased research productivity. Applying the broader, research and innovation funding need to be rethought and developed further in order to provide incentives an stimulation for researchers mainly to develop the entrepreneurial uni- versity.")
)
Allsentences<-list()
for (i in 1:length(allfiles)) {
    allfiles[[i]][1] %>% 
        stringr::str_replace_all("[\r\n]" , " ") %>% 
        stringr::str_replace_all("\\s " , " ") %>% 
        stringr::str_extract_all('funding')->Allsentences[i]
}

In the "stringr::str_extract_all('funding')" line, I need to extract the sentences that the word "funding" appears in, but in the following pattern: Start sentence with a capital letter the sentence must contain the word funduing and end with a period Capital letter. I'm using this syntax "^(?=[A-Z]{1}). (funding). (?=.\s[A-Z]{1})", but it doesn't return what I need.

Expected Output:

Allsentences<-list(c("New companies and institutions, public funding incentives in the next section. Researchers, making a academic career.", "They profiling, i.e., institution’s funding schemes.", "Instruments funding include elements.", "Competitive funding from third parties has different implications depending on the source of the funds.", "Typically, institutional funding provides stable financial resources which can be a basis 13 814 M."), c("Performance-based and competitive funding instead for HEIs to activities.", "This second funding mechanism conditions for research careers.", "A well-known example of funding has both research pro- ductivity and quality.", "funding", 
"In addition case, recent  funding for research has increased research productivity."))

CodePudding user response:

A 'non-greedy' approach should work, i.e. 'look behind' for ". " then a capital letter, then any number of characters that aren't ".", then the word "funding", then any number of characters that aren't ".", then, finally, ".".

E.g.

library(tidyverse)

allfiles<-list(
  c("The Journal of Technology Transfer  45:806–819 https://doi.org/10.1007/s10961-018-9699-8 New companies and institutions, public funding incentives in the next section. Researchers, making a academic career. Unger actors, identifying KT: • Public businesses;  provision of funding for higher education;  defining for funding and planning activity. Governance of HEIs is increasingly and competition.  They profiling, i.e., institution’s funding schemes.  Several recent in design and targets. Instruments funding include elements. This, on one hand, is intermediar- ies. On the other, with universities private sector. Competitive funding from third parties has different implications depending on the source of the funds. Typically, institutional funding provides stable financial resources which can be a basis 13 814 M. Unger et all for strategic planning of long-term research activities."), 
  c("The Journal of Technology Transfer  45:806–819 https://doi.org/10.1007/s10961-018-0001-8 Performance-based and competitive funding instead for HEIs to activities. This second funding mechanism conditions for research careers.  Such schemes can improve industry. Third party partners, academic and private sectors. A well-known example of funding has both research pro- ductivity and quality. In addition case, recent  funding for research has increased research productivity. Applying the broader, research and innovation funding need to be rethought and developed further in order to provide incentives an stimulation for researchers mainly to develop the entrepreneurial uni- versity.")
)
Allsentences<-list()

for (i in 1:length(allfiles)) {
  allfiles[[i]][1] %>% 
    stringr::str_replace_all("[\r\n]" , " ") %>% 
    stringr::str_replace_all("\\s " , " ") %>% 
    stringr::str_extract_all("(?<=\\. )[A-Z][^\\.]*funding[^\\.]*\\.") -> Allsentences[i]
}
Allsentences
#> [[1]]
#> [1] "Unger actors, identifying KT: • Public businesses; provision of funding for higher education; defining for funding and planning activity."
#> [2] "Instruments funding include elements."                                                                                                    
#> [3] "Competitive funding from third parties has different implications depending on the source of the funds."                                  
#> [4] "Typically, institutional funding provides stable financial resources which can be a basis 13 814 M."                                      
#> 
#> [[2]]
#> [1] "This second funding mechanism conditions for research careers."                                                                                                                                                    
#> [2] "A well-known example of funding has both research pro- ductivity and quality."                                                                                                                                     
#> [3] "In addition case, recent funding for research has increased research productivity."                                                                                                                                
#> [4] "Applying the broader, research and innovation funding need to be rethought and developed further in order to provide incentives an stimulation for researchers mainly to develop the entrepreneurial uni- versity."

Created on 2022-09-14 by the reprex package (v2.0.1)


Edit: After re-reading your question, you could also add a positive 'look ahead' at the end for a space and another capital letter ((?= [A-Z])):

library(tidyverse)

allfiles<-list(
  c("The Journal of Technology Transfer  45:806–819 https://doi.org/10.1007/s10961-018-9699-8 New companies and institutions, public funding incentives in the next section. Researchers, making a academic career. Unger actors, identifying KT: • Public businesses;  provision of funding for higher education;  defining for funding and planning activity. Governance of HEIs is increasingly and competition.  They profiling, i.e., institution’s funding schemes.  Several recent in design and targets. Instruments funding include elements. This, on one hand, is intermediar- ies. On the other, with universities private sector. Competitive funding from third parties has different implications depending on the source of the funds. Typically, institutional funding provides stable financial resources which can be a basis 13 814 M. Unger et all for strategic planning of long-term research activities."), 
  c("The Journal of Technology Transfer  45:806–819 https://doi.org/10.1007/s10961-018-0001-8 Performance-based and competitive funding instead for HEIs to activities. This second funding mechanism conditions for research careers.  Such schemes can improve industry. Third party partners, academic and private sectors. A well-known example of funding has both research pro- ductivity and quality. In addition case, recent  funding for research has increased research productivity. Applying the broader, research and innovation funding need to be rethought and developed further in order to provide incentives an stimulation for researchers mainly to develop the entrepreneurial uni- versity.")
)
Allsentences<-list()

for (i in 1:length(allfiles)) {
  allfiles[[i]][1] %>% 
    stringr::str_replace_all("[\r\n]" , " ") %>% 
    stringr::str_replace_all("\\s " , " ") %>% 
    stringr::str_extract_all("[A-Z][^\\.]*funding[^\\.]*\\.(?= [A-Z])") -> Allsentences[i]
}
Allsentences
#> [[1]]
#> [1] "New companies and institutions, public funding incentives in the next section."                                                           
#> [2] "Unger actors, identifying KT: • Public businesses; provision of funding for higher education; defining for funding and planning activity."
#> [3] "Instruments funding include elements."                                                                                                    
#> [4] "Competitive funding from third parties has different implications depending on the source of the funds."                                  
#> [5] "Typically, institutional funding provides stable financial resources which can be a basis 13 814 M."                                      
#> 
#> [[2]]
#> [1] "Performance-based and competitive funding instead for HEIs to activities."         
#> [2] "This second funding mechanism conditions for research careers."                    
#> [3] "A well-known example of funding has both research pro- ductivity and quality."     
#> [4] "In addition case, recent funding for research has increased research productivity."

Created on 2022-09-14 by the reprex package (v2.0.1)

  • Related