passage <- "Approximately 60 cases have been reported in the medical literature. Onset is in infancy."
I would like to break up this passage into chunks of up to n=3
words. That is, I would like to have the output
[1] "Approximately" "Approximately 60" "Approximately 60 cases" "60"
[5] "60 cases" "60 cases have" "cases" "cases have"
[9] "cases have been" "have" "have been" "have been reported"
[13] "been" "been reported" "been reported in" "reported"
[17] "reported in" "reported in the" "in" "in the"
[21] "in the medical" "the" "the medical" "the medical literature"
[25] "medical" "medical literature. Onset" "literature" "literature. Onset is"
[29] "Onset" "Onset is" "Onset is in" "is"
[33] "is in" "is in infancy" "in" "in infancy"
[37] "infancy"
Is there a quick way to do this in R?
CodePudding user response:
We may split the string and paste
with Reduce
library(data.table)
out <- grep("NA\\b", c(do.call(rbind, Reduce(function(x, y)
paste(x, y, sep = " "),
shift(strsplit(passage, " ")[[1]], n = 0:2, type = "lead"),
accumulate = TRUE))), invert = TRUE, value = TRUE)
-output
> out
[1] "Approximately" "Approximately 60" "Approximately 60 cases" "60" "60 cases"
[6] "60 cases have" "cases" "cases have" "cases have been" "have"
[11] "have been" "have been reported" "been" "been reported" "been reported in"
[16] "reported" "reported in" "reported in the" "in" "in the"
[21] "in the medical" "the" "the medical" "the medical literature." "medical"
[26] "medical literature." "medical literature. Onset" "literature." "literature. Onset" "literature. Onset is"
[31] "Onset" "Onset is" "Onset is in" "is" "is in"
[36] "is in infancy." "in" "in infancy." "infancy."