I've used quanteda to tokenize 10 texts and the result looks like
text 1 [character] word 1, word 2, word 3...
text 2 [character] word 1, word 2, word 3...
...
The type of this file is 'tokens'. So I'd like to change it to a dataframe like:
id content
text1 word 1
text1 word 2
text1 word 3
text2 word 1
text2 word 2
...
I've tried
data.frame(id = 1: length(the token file), content = unlist (the token file))
It doesn't work because of the different length of rows.
Could anyone help? Thank you!
CodePudding user response:
Normally you would go via dfm
and convert
to get where you are.
Since you didn't give an example, I will use a part of the data_corpus_inaugural
corpus from quanteda.
library(quanteda)
library(tidyr)
library(dplyr)
# create tokens
toks <- data_corpus_inaugural %>%
corpus_subset(Year > 1990) %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE)
# convert to data.frame via dfm
out <- convert(dfm(toks), to = "data.frame")
# pivot to get desired outcome
my_df <- out %>%
pivot_longer(cols = c(!doc_id), names_to = "tokens", values_to = "freq")
my_df
# A tibble: 21,312 × 3
doc_id tokens freq
<chr> <chr> <dbl>
1 1993-Clinton my 7
2 1993-Clinton fellow 5
3 1993-Clinton citizens 2
4 1993-Clinton today 10
5 1993-Clinton we 52
6 1993-Clinton celebrate 3
7 1993-Clinton the 89
8 1993-Clinton mystery 1
9 1993-Clinton of 46
10 1993-Clinton american 4
# … with 21,302 more rows
After this, you can drop the column freq as this contains the frequencies of the words, but you also need to filter out the words that have a frequency of 0 as these are the words in other texts that do not appear in this text.
my_df %>%
filter(freq != 0)
Now if you want to get the sentences you tokenized in the exact order back, you need to do things a bit differently. dfm
rolls all the same words into one. This means all the "the"'s in the first text will appear as one item with a frequency count.
So to get the tokens in order form the tokens object you need to do something else. I will take the same tokens object as before, toks
, then use as.list
to get in into a named list, and from there via sapply
into a named array with equal lengths so we can use as_tibble
to create a data.frame and avoid the error you got about different row lengths. Later on I remove all the tokens that have an NA value as these are the additions to each text to make sure everything is of the same length.
tok_out <- as.list(toks)
# create named array of equal lengths
x <- sapply(tok_out, '[', seq(max(lengths(tok_out))))
my_df_via_toks <- x %>%
as_tibble() %>%
pivot_longer(cols = everything(), names_to = "text", values_to = "tokens") %>%
filter(!is.na(tokens)) %>% # remove NA values of each text
arrange(text)
# A tibble: 15,700 × 2
text tokens
<chr> <chr>
1 1993-Clinton My
2 1993-Clinton fellow
3 1993-Clinton citizens
4 1993-Clinton today
5 1993-Clinton we
6 1993-Clinton celebrate
7 1993-Clinton the
8 1993-Clinton mystery
9 1993-Clinton of
10 1993-Clinton American
# … with 15,690 more rows