I have a very long string (~1000 words) and I would like to split it into two-word phrases.
I have this:
string <- "A B C D E F"
and I would like this:
"A B"
"B C"
"C D"
"D E"
"E F"
The long string has already been cleaned and stemmed, and stop-words have been removed.
I tried to use str_split, but (I think) this needs a separator, which here is complicated because I don't want to separate A from B only "A B" from "C D", and "B C" from "D E", etc.
CodePudding user response:
Split on space, then paste with shift:
s <- unlist(strsplit(string, " ", fixed = TRUE))
sl <- length(s)
paste(s[1:(sl-1)], s[2:sl])
# [1] "A B" "B C" "C D" "D E" "E F"
CodePudding user response:
tmp <- strsplit(string, " ")[[1]]
tmp
# [1] "A" "B" "C" "D" "E" "F"
sapply(seq_along(tmp)[-1], function(z) paste(tmp[z-1:0], collapse = " "))
# [1] "A B" "B C" "C D" "D E" "E F"
CodePudding user response:
If you already use some text mining package (as cleaned, stemmed and removed stop-words would suggest), there's most likely something to generate n-grams (and not just bigrams). For example quanteda::tokens_ngrams()
or tidytext::unnest_ngrams()
:
string <- "A B C D E F"
quanteda::tokens_ngrams(quanteda::tokens(string), concatenator = " ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "A B" "B C" "C D" "D E" "E F"
data.frame(s = string) |>
tidytext::unnest_ngrams(input = "s", output = "bigrams", n = 2)
#> bigrams
#> 1 a b
#> 2 b c
#> 3 c d
#> 4 d e
#> 5 e f
Created on 2023-01-31 with reprex v2.0.2
CodePudding user response:
An option would be to use a regex with look ahead.
string <- "A B C D E F"
. <- gregexpr("\\S \\s (?=(\\S ))", string, perl=TRUE)[[1]]
attr(.,"match.length") <- attr(.,"match.length") attr(., "capture.length")
regmatches(string, list(.))[[1]]
#[1] "A B" "B C" "C D" "D E" "E F"