Word count across subset of columns in one new column-CodePudding

I have the following data frame:

structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.", 
"This is no text"), y = c("What is text?", "Can it eat text?", 
"Maybe I will try.")), class = "data.frame", row.names = c(NA, 
-3L))

I would like to count the number of words across the columns x and y and sum up the value to get one column with the total number of words used per column. It is important that I am able to subset the data. The result shoud look like this:

structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.", 
"This is no text"), y = c("What is text?", "Can it eat text?", 
"Maybe I will try."), z = c("6", "8", "8")), class = "data.frame", row.names = c(NA, 
-3L))

I have tried using str_count(" ") with different regex expressions in combination with across or apply but I do not seem to get the solution.

CodePudding user response：

You can use str_split and then count the length of the result. For simplicity, I added a column xy that has the combined words of x and y:

my_df <- my_df %>% mutate(xy= paste(x, y))
z <- c(rep(0, length(my_df$xy)))
for (i in 1:length(my_df$xy)) z[i]<-length(str_split_fixed(my_df$xy[i], " ", Inf))
cbind(my_df, z)

CodePudding user response：

Here solution using tokenizers:

library(tokenizers)

df <- 
  structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.", 
  "This is no text"), y = c("What is text?", "Can it eat text?", 
  "Maybe I will try.")), class = "data.frame", row.names = c(NA, 
  -3L))

df$z = tokenizers::count_words(df$x)   tokenizers::count_words(df$y)

df
#>   g                 x                 y z
#> 1 1     This is text.     What is text? 6
#> 2 2 This is text too.  Can it eat text? 8
#> 3 3   This is no text Maybe I will try. 8

If you prefer pure R:

df$z <- rowSums(
  sapply(df[,c("x","y")],function(x)  
    sapply(gregexpr("\\b\\w \\b", x) , function(x) 
      if(x[[1]] > 0) length(x) else 0)))

Note that \w matches all words and \b matches word boundaries, though i believe "\w" suffices

CodePudding user response：

One possible solution:

df$z = stringi::stri_count_words(paste(df$x, df$y))

  g                 x                 y z
1 1     This is text.     What is text? 6
2 2 This is text too.  Can it eat text? 8
3 3   This is no text Maybe I will try. 8

df$z = lengths(gregexpr("\\b\\w \\b", paste(df$x, df$y)))