I am in the early stages of building/testing my own defined dictionary. I am testing it with a set of American state party platforms (corpus of 30 txt files). I have successfully created the dictionary and used Quanteda to provide summary statistics, but it only seems to do this for 6 files at time and my plan is to use the dictionary on hundreds of files going back decades. Is there a way to display more than 6 documents at a time?
Here is the code I used that produced data frame for the 6 files and the error message:
corp_platform <- corpus(corp)
toks_platform <- tokens(corp_platform)
dict_toks <- tokens_lookup(toks_platform, dictionary = dict)
print(dict_toks)
dfm(dict_toks)
Document-feature matrix of: 30 documents, 2 features (1.67% sparse) and 2 docvars.
commmunitarian individualist
akdem20.txt 113 20
azdem20.txt 60 13
cadem20.txt 254 98
medem20.txt 27 7
mndfl20.txt 40 18
ncdem20.txt 235 64
[ reached max_ndoc ... 24 more documents ]
CodePudding user response:
The print methods for core objects, such as dfm objects, by default only print a specified number of rows. That's what you are seeing here, and why it states:
Document-feature matrix of: 30 documents
[...]
.. 24 more documents ]
It's telling you that all 30 documents are there.
This is all documented. See help("print-methods", package = "quanteda")
. If you want summary statistics, try quanteda.textstats::textstat_frequency()
. Or if you want the dfm as a data.frame, use convert(dfm(dict_toks), to = "data.frame")
.
CodePudding user response:
Thanks very much. I just needed a way to display the output, and could not find an example, so this is very helpful. I changed the code I have used to:
'corp_platform <- corpus(corp)
toks_platform <- quanteda::tokens(corp_platform)
dict_toks <- tokens_lookup(toks_platform, dictionary = dict)
print(dict_toks)
dfm(dict_toks)
convert(dfm(dict_toks), to = "data.frame")'
and the output is:
'doc_ id commmunitarian individualist
1 akdem20.txt 113 20
2 azdem20.txt 60 13
3 cadem20.txt 254 98
4 medem20.txt 27 7
5 mndfl20.txt 40 18
.........................................
25 tx2022draft.txt 198 156
26 txgop20.txt 181 153
27 wagop20.txt 52 63
28 wigop20.txt 27 11
29 wvgop20.txt 72 47
30 wygop20.txt 22 21'