How to put quanteda tokens into a dataframe-CodePudding

I've used quanteda to tokenize 10 texts and the result looks like

text 1    [character]   word 1, word 2, word 3...  
text 2    [character]   word 1, word 2, word 3...  
...

The type of this file is 'tokens'. So I'd like to change it to a dataframe like:

id    content  
text1  word 1  
text1  word 2  
text1  word 3  
text2  word 1  
text2  word 2  
...

I've tried

data.frame(id = 1: length(the token file), content = unlist (the token file))

It doesn't work because of the different length of rows.
Could anyone help? Thank you!

CodePudding user response：

Normally you would go via dfm and convert to get where you are.

Since you didn't give an example, I will use a part of the data_corpus_inaugural corpus from quanteda.

library(quanteda)
library(tidyr)
library(dplyr)

# create tokens
toks <- data_corpus_inaugural %>%
  corpus_subset(Year > 1990) %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE)

# convert to data.frame via dfm
out <- convert(dfm(toks), to = "data.frame")

# pivot to get desired outcome
my_df <- out %>% 
  pivot_longer(cols = c(!doc_id), names_to = "tokens", values_to = "freq")

my_df
# A tibble: 21,312 × 3
   doc_id       tokens    freq
   <chr>        <chr>     <dbl>
 1 1993-Clinton my            7
 2 1993-Clinton fellow        5
 3 1993-Clinton citizens      2
 4 1993-Clinton today        10
 5 1993-Clinton we           52
 6 1993-Clinton celebrate     3
 7 1993-Clinton the          89
 8 1993-Clinton mystery       1
 9 1993-Clinton of           46
10 1993-Clinton american      4
# … with 21,302 more rows

After this, you can drop the column freq as this contains the frequencies of the words, but you also need to filter out the words that have a frequency of 0 as these are the words in other texts that do not appear in this text.

my_df %>% 
  filter(freq != 0)

Now if you want to get the sentences you tokenized in the exact order back, you need to do things a bit differently. dfm rolls all the same words into one. This means all the "the"'s in the first text will appear as one item with a frequency count.

So to get the tokens in order form the tokens object you need to do something else. I will take the same tokens object as before, toks, then use as.list to get in into a named list, and from there via sapply into a named array with equal lengths so we can use as_tibble to create a data.frame and avoid the error you got about different row lengths. Later on I remove all the tokens that have an NA value as these are the additions to each text to make sure everything is of the same length.

tok_out <- as.list(toks)

# create named array of equal lengths
x <- sapply(tok_out, '[', seq(max(lengths(tok_out))))

my_df_via_toks <- x %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything(), names_to = "text", values_to = "tokens") %>% 
  filter(!is.na(tokens)) %>%  # remove NA values of each text
  arrange(text)

# A tibble: 15,700 × 2
   text         tokens   
   <chr>        <chr>    
 1 1993-Clinton My       
 2 1993-Clinton fellow   
 3 1993-Clinton citizens 
 4 1993-Clinton today    
 5 1993-Clinton we       
 6 1993-Clinton celebrate
 7 1993-Clinton the      
 8 1993-Clinton mystery  
 9 1993-Clinton of       
10 1993-Clinton American 
# … with 15,690 more rows