I have the following data frame:
structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.",
"This is no text"), y = c("What is text?", "Can it eat text?",
"Maybe I will try.")), class = "data.frame", row.names = c(NA,
-3L))
I would like to count the number of words across the columns x
and y
and sum up the value to get one column with the total number of words used per column. It is important that I am able to subset the data. The result shoud look like this:
structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.",
"This is no text"), y = c("What is text?", "Can it eat text?",
"Maybe I will try."), z = c("6", "8", "8")), class = "data.frame", row.names = c(NA,
-3L))
I have tried using str_count(" ")
with different regex expressions in combination with across
or apply
but I do not seem to get the solution.
CodePudding user response:
You can use str_split
and then count the length of the result.
For simplicity, I added a column xy that has the combined words of x and y:
my_df <- my_df %>% mutate(xy= paste(x, y))
z <- c(rep(0, length(my_df$xy)))
for (i in 1:length(my_df$xy)) z[i]<-length(str_split_fixed(my_df$xy[i], " ", Inf))
cbind(my_df, z)
CodePudding user response:
Here solution using tokenizers
:
library(tokenizers)
df <-
structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.",
"This is no text"), y = c("What is text?", "Can it eat text?",
"Maybe I will try.")), class = "data.frame", row.names = c(NA,
-3L))
df$z = tokenizers::count_words(df$x) tokenizers::count_words(df$y)
df
#> g x y z
#> 1 1 This is text. What is text? 6
#> 2 2 This is text too. Can it eat text? 8
#> 3 3 This is no text Maybe I will try. 8
If you prefer pure R:
df$z <- rowSums(
sapply(df[,c("x","y")],function(x)
sapply(gregexpr("\\b\\w \\b", x) , function(x)
if(x[[1]] > 0) length(x) else 0)))
Note that \w
matches all words and \b
matches word boundaries, though i believe "\w" suffices
CodePudding user response:
One possible solution:
df$z = stringi::stri_count_words(paste(df$x, df$y))
g x y z
1 1 This is text. What is text? 6
2 2 This is text too. Can it eat text? 8
3 3 This is no text Maybe I will try. 8
Or
df$z = lengths(gregexpr("\\b\\w \\b", paste(df$x, df$y)))