In R, chop off column after n words-CodePudding

I have a df with a text column, and a column with a wordcount value.

How can I delete the last n words of the text (specified in the 'wc' column) and save the output to a third column?

In other words, I need the "introductory" part of a bunch of texts, and I know when the intro ends, so I want to cut the text off at that point and save the intro in a new column.

df <- data.frame(text = c("this is a long text","this is also a long text", "another long text"),wc=c('1','2','1'))

Desired output:

text	wc	chopped_off_text
this is a long text	1	this is a long
this is also a long text	2	this is also a
another long text	1	another long

CodePudding user response：

require(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.

df <- data.frame(text = c("this is a long text", 
                          "this is also a long text", 
                          "another long text"), 
                 wc = c(1, 2, 1))
corp <- corpus(df)
toks <- tokens(corp)
tokens_select(toks, startpos = rep(1, ndoc(toks)), endpos = ntoken(toks) - toks$wc)
#> Tokens consisting of 3 documents and 1 docvar.
#> text1 :
#> [1] "this" "is"   "a"    "long"
#> 
#> text2 :
#> [1] "this" "is"   "also" "a"   
#> 
#> text3 :
#> [1] "another" "long"

^{Created on 2022-06-04 by the reprex package (v2.0.1)}

CodePudding user response：

You can use the word function from the stringr package to extract "words" in a sentence. str_count(text, "\\s") 1 counts the number of words present in the sentence.

library(stringr)
library(dplyr)

df %>% 
  mutate(chopped_off_text = 
           word(text, 1, end = str_count(text, "\\s")   1 - as.integer(wc)))

                      text wc chopped_off_text
1      this is a long text  1   this is a long
2 this is also a long text  2   this is also a
3        another long text  1     another long