Home > Software engineering >  R Write function to get unigrams in dataframe
R Write function to get unigrams in dataframe

Time:12-03

I want to write a function that gets the number of unigrams (one single word). However, my current function does not work the way I want it to.
This is my function and example dataset:

library(ngrams)
library(tidyverse)

#dataframe
df<-tribble(~text,
            "This sentence",
            "I am going to luch",
            "This is a really nice and sunny day")

#function
get_unigrams <- function(text) {
  
  unigram<-  ngram(text, n = 1) %>% get.ngrams() %>% length()

  return(unigram)
}

However, the calculation with the "mutate" function gives me a very strange result:

df %>% mutate(n=get_unigrams((text)))

# A tibble: 3 x 2
  text                                    n
  <chr>                               <int>
1 This sentence                          14
2 I am going to luch                     14
3 This is a really nice and sunny day    14

Each sentence length is equal. I think this is because all three lines of text are put together and considered as one text.
But, I would like to have this result:

# A tibble: 3 x 2
  text                                    n
  <chr>                               <int>
1 This sentence                           2
2 I am going to luch                      5
3 This is a really nice and sunny day     8

Can someone help me?
I do not see the error in my function.
Many thanks in advance!

Update:

I have found an (interim) solution:

get_unigrams <- function(text) {
  sapply(text, function(text){
  unigram<-  ngram(text, n = 1) %>% get.ngrams() %>% length()
  
  return(unigram)
  }
  )
}

However, the solution with the sapply-function is very slow (because it executes each row individually). I have a dataframe with more than 100k rows.
Can someone help me increase the speed? For example with a vectorised function?

CodePudding user response:

Use rowwise. Look into ?rowwise for more info.

df %>% rowwise() %>% 
  mutate(n=get_unigrams(text))

  text                                    n
  <chr>                               <int>
1 This sentence                           2
2 I am going to luch                      5
3 This is a really nice and sunny day     8

Another solution (using base R) is:

df$n <- apply(df, 1, get_unigrams)

CodePudding user response:

Another solution, based on stringr::str_count:

library(tidyverse)

df<-tribble(~text,
            "This sentence",
            "I am going to luch",
            "This is a really nice and sunny day")

df %>% 
  mutate(n = str_count(text, "\\w "))

#> # A tibble: 3 × 2
#>   text                                    n
#>   <chr>                               <int>
#> 1 This sentence                           2
#> 2 I am going to luch                      5
#> 3 This is a really nice and sunny day     8
  • Related