I have a df with a text column, and a column with a wordcount value.
How can I delete the last n words of the text (specified in the 'wc' column) and save the output to a third column?
In other words, I need the "introductory" part of a bunch of texts, and I know when the intro ends, so I want to cut the text off at that point and save the intro in a new column.
df <- data.frame(text = c("this is a long text","this is also a long text", "another long text"),wc=c('1','2','1'))
Desired output:
text | wc | chopped_off_text |
---|---|---|
this is a long text | 1 | this is a long |
this is also a long text | 2 | this is also a |
another long text | 1 | another long |
CodePudding user response:
require(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
df <- data.frame(text = c("this is a long text",
"this is also a long text",
"another long text"),
wc = c(1, 2, 1))
corp <- corpus(df)
toks <- tokens(corp)
tokens_select(toks, startpos = rep(1, ndoc(toks)), endpos = ntoken(toks) - toks$wc)
#> Tokens consisting of 3 documents and 1 docvar.
#> text1 :
#> [1] "this" "is" "a" "long"
#>
#> text2 :
#> [1] "this" "is" "also" "a"
#>
#> text3 :
#> [1] "another" "long"
Created on 2022-06-04 by the reprex package (v2.0.1)
CodePudding user response:
You can use the word
function from the stringr
package to extract "words" in a sentence. str_count(text, "\\s") 1
counts the number of words present in the sentence.
library(stringr)
library(dplyr)
df %>%
mutate(chopped_off_text =
word(text, 1, end = str_count(text, "\\s") 1 - as.integer(wc)))
text wc chopped_off_text
1 this is a long text 1 this is a long
2 this is also a long text 2 this is also a
3 another long text 1 another long