Split string into multiple two-word strings-CodePudding

I have a very long string (~1000 words) and I would like to split it into two-word phrases.

I have this:

string <- "A B C D E F"

and I would like this:

"A B"
"B C"
"C D"
"D E"
"E F"

The long string has already been cleaned and stemmed, and stop-words have been removed.

I tried to use str_split, but (I think) this needs a separator, which here is complicated because I don't want to separate A from B only "A B" from "C D", and "B C" from "D E", etc.

CodePudding user response：

Split on space, then paste with shift:

s <- unlist(strsplit(string, " ", fixed = TRUE))
sl <- length(s)
paste(s[1:(sl-1)], s[2:sl])
# [1] "A B" "B C" "C D" "D E" "E F"

CodePudding user response：

tmp <- strsplit(string, " ")[[1]]
tmp
# [1] "A" "B" "C" "D" "E" "F"
sapply(seq_along(tmp)[-1], function(z) paste(tmp[z-1:0], collapse = " "))
# [1] "A B" "B C" "C D" "D E" "E F"

CodePudding user response：

If you already use some text mining package (as cleaned, stemmed and removed stop-words would suggest), there's most likely something to generate n-grams (and not just bigrams). For example quanteda::tokens_ngrams() or tidytext::unnest_ngrams():

string <- "A B C D E F"

quanteda::tokens_ngrams(quanteda::tokens(string), concatenator = " ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "A B" "B C" "C D" "D E" "E F"

data.frame(s = string) |>
  tidytext::unnest_ngrams(input = "s", output = "bigrams", n = 2)
#>   bigrams
#> 1     a b
#> 2     b c
#> 3     c d
#> 4     d e
#> 5     e f

^{Created on 2023-01-31 with reprex v2.0.2}

CodePudding user response：

An option would be to use a regex with look ahead.

string <- "A B C D E F"

. <- gregexpr("\\S \\s (?=(\\S ))", string, perl=TRUE)[[1]]
attr(.,"match.length") <- attr(.,"match.length")   attr(., "capture.length")
regmatches(string, list(.))[[1]]
#[1] "A B" "B C" "C D" "D E" "E F"