Home > Enterprise >  How to chunk a sentence into up to n-word strings?
How to chunk a sentence into up to n-word strings?

Time:01-31

passage <- "Approximately 60 cases have been reported in the medical literature. Onset is in infancy." 

I would like to break up this passage into chunks of up to n=3 words. That is, I would like to have the output

 [1] "Approximately"             "Approximately 60"          "Approximately 60 cases"    "60"                       
 [5] "60 cases"                  "60 cases have"             "cases"                     "cases have"               
 [9] "cases have been"           "have"                      "have been"                 "have been reported"       
[13] "been"                      "been reported"             "been reported in"          "reported"                 
[17] "reported in"               "reported in the"           "in"                        "in the"                   
[21] "in the medical"            "the"                       "the medical"               "the medical literature"   
[25] "medical"                   "medical literature. Onset" "literature"                "literature. Onset is"     
[29] "Onset"                     "Onset is"                  "Onset is in"               "is"                       
[33] "is in"                     "is in infancy"             "in"                        "in infancy"               
[37] "infancy"

Is there a quick way to do this in R?

CodePudding user response:

We may split the string and paste with Reduce

library(data.table)
out <- grep("NA\\b", c(do.call(rbind, Reduce(function(x, y)
      paste(x, y, sep = " "), 
   shift(strsplit(passage, " ")[[1]], n = 0:2, type = "lead"), 
     accumulate = TRUE))), invert = TRUE, value = TRUE)

-output

> out
 [1] "Approximately"             "Approximately 60"          "Approximately 60 cases"    "60"                        "60 cases"                 
 [6] "60 cases have"             "cases"                     "cases have"                "cases have been"           "have"                     
[11] "have been"                 "have been reported"        "been"                      "been reported"             "been reported in"         
[16] "reported"                  "reported in"               "reported in the"           "in"                        "in the"                   
[21] "in the medical"            "the"                       "the medical"               "the medical literature."   "medical"                  
[26] "medical literature."       "medical literature. Onset" "literature."               "literature. Onset"         "literature. Onset is"     
[31] "Onset"                     "Onset is"                  "Onset is in"               "is"                        "is in"                    
[36] "is in infancy."            "in"                        "in infancy."               "infancy."     
  • Related