Home > other >  unnest_tokens and keep original columns (tidytext)
unnest_tokens and keep original columns (tidytext)

Time:11-22

The unnest_tokens function of the package tidytext is supposed to keep the other columns of the dataframe (tibble) you pass to it. In the example provided by the authors of the package ("tidy_books" on Austen's data) it works fine, but I get some weird behaviour on these data.

poem1 <- "Tous les poteaux télégraphiques
Viennent là-bas le long du quai
Sur son sein notre République
A mis ce bouquet de muguet"

poem2 <- "La sottise, l'erreur, le péché, la lésine,
Occupent nos esprits et travaillent nos corps,
Et nous alimentons nos aimables remords,
Comme les mendiants nourrissent leur vermine."

poems <- tibble(n_poem = 1:2, text_poem = c(poem1, poem2))

poems <- poems %>% 
  unnest_tokens(output = lines_poem, input = text_poem, token = "lines")

poems <- poems %>% group_by(n_poem) %>% 
  mutate(n_line = row_number())

This makes me lose all columns:

poems %>% unnest_tokens(output = words_poem, input = lines_poem)

The drop option behaves weirdly and brings back the raw text:

poems %>% unnest_tokens(output = words_poem, input = lines_poem, drop = F)

CodePudding user response:

You need to ungroup your data. In the argument for collapse, you can see that grouping data automatically collapses the text in each group when not dropping:

Grouping data specifies variables to collapse across in the same way as collapse but you cannot use both the collapse argument and grouped data. Collapsing applies mostly to token options of "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".

I'm assuming this is your expected behaviour:

poems %>%
  ungroup() %>%
  unnest_tokens(output = words_poem, input = lines_poem, drop = F)
#> # A tibble: 48 × 4
#>    n_poem lines_poem                      n_line words_poem    
#>     <int> <chr>                            <int> <chr>         
#>  1      1 tous les poteaux télégraphiques      1 tous          
#>  2      1 tous les poteaux télégraphiques      1 les           
#>  3      1 tous les poteaux télégraphiques      1 poteaux       
#>  4      1 tous les poteaux télégraphiques      1 télégraphiques
#>  5      1 viennent là-bas le long du quai      2 viennent      
#>  6      1 viennent là-bas le long du quai      2 là            
#>  7      1 viennent là-bas le long du quai      2 bas           
#>  8      1 viennent là-bas le long du quai      2 le            
#>  9      1 viennent là-bas le long du quai      2 long          
#> 10      1 viennent là-bas le long du quai      2 du            
#> # … with 38 more rows
  • Related