Home > OS >  getting the last 10 words from a string, applied on a vector of strings
getting the last 10 words from a string, applied on a vector of strings

Time:12-19

I have a vector of texts within a dataframe (df1$text), and I am trying to create a new vector with the last 10 words of the text (df1$last.ten). I've tried the following without success:

df1$last.ten = mapply(function(x,y) paste(word(x,y), collapse=" "), df1$text, -1:-10)

But am getting just one word instead of a string of ten words:

> df1$last.ten[1]
[1] "end."

It works just fine when I feed it a string, so it seems I'm utilizing mapply erroneously.

I've tried to use gsub for this but could not figure out the syntax. Would appreciate a word() or gsub() solution. Thanks!

CodePudding user response:

Here's a base R option -

#example data
df1 <- data.frame(text = c('This is a long text which consists of words more than 10', 
                           'This is another one which is similar to first one but even longer'))

#split string on space for every word and paste the last 10 words in one string
df1$last.ten <- sapply(strsplit(df1$text, '\\s '), function(x) 
                       paste0(tail(x, 10), collapse = ' '))
df1

CodePudding user response:

I've made some sample data. Maybe you don't need to use an apply function.


df1 <- data.frame(text = c("one two three four five six seven eight nine ten eleven","one two three four five six seven eight nine ten eleven twelve"))


df1$last.ten <- word(df1[[1]], str_count(df1[[1]], '\\w ') - 9, str_count(df1[[1]], '\\w '))

enter image description here

CodePudding user response:

If this is your data frame (toy data)

df1
                                                            text
1 one two three four five six seven eight nine ten eleven twelve
2 one two three four five six seven eight nine ten eleven twelve
3 one two three four five six seven eight nine ten eleven twelve

then extract the last 10 words like this

rnge <- 10:1

df1$last.ten <- apply( t(apply( as.data.frame(df1$text), 1, function(x)
  rev( unlist( strsplit(x, " ") ) ) )[rnge,]), 1, paste, collapse=" " )

df1
                                                            text
1 one two three four five six seven eight nine ten eleven twelve
2 one two three four five six seven eight nine ten eleven twelve
3 one two three four five six seven eight nine ten eleven twelve
                                                last.ten
1 three four five six seven eight nine ten eleven twelve
2 three four five six seven eight nine ten eleven twelve
3 three four five six seven eight nine ten eleven twelve

This extracts data from anywhere if you adjust the range rnge

rnge <- 5:3

df1$mid <- apply( t(apply( as.data.frame(df1$text), 1, function(x)
  rev( unlist( strsplit(x, " ") ) ) )[rnge,]), 1, paste, collapse=" " )

df1
                                                            text
1 one two three four five six seven eight nine ten eleven twelve
2 one two three four five six seven eight nine ten eleven twelve
3 one two three four five six seven eight nine ten eleven twelve
                                                last.ten            mid
1 three four five six seven eight nine ten eleven twelve eight nine ten
2 three four five six seven eight nine ten eleven twelve eight nine ten
3 three four five six seven eight nine ten eleven twelve eight nine ten
  • Related