Quanteda - display full output; error message: "reached max_ndoc ... 24 more documents"-CodePudding

I am in the early stages of building/testing my own defined dictionary. I am testing it with a set of American state party platforms (corpus of 30 txt files). I have successfully created the dictionary and used Quanteda to provide summary statistics, but it only seems to do this for 6 files at time and my plan is to use the dictionary on hundreds of files going back decades. Is there a way to display more than 6 documents at a time?

Here is the code I used that produced data frame for the 6 files and the error message:

corp_platform <- corpus(corp)
toks_platform <- tokens(corp_platform)

dict_toks <- tokens_lookup(toks_platform, dictionary = dict)
print(dict_toks)

dfm(dict_toks)

Document-feature matrix of: 30 documents, 2 features (1.67% sparse) and 2 docvars.
        
 commmunitarian individualist


 akdem20.txt            113            20
 azdem20.txt             60            13
 cadem20.txt            254            98
 medem20.txt             27             7
 mndfl20.txt             40            18
 ncdem20.txt            235            64

[ reached max_ndoc ... 24 more documents ]

CodePudding user response：

The print methods for core objects, such as dfm objects, by default only print a specified number of rows. That's what you are seeing here, and why it states:

Document-feature matrix of: 30 documents

[...]

.. 24 more documents ]

It's telling you that all 30 documents are there.

This is all documented. See help("print-methods", package = "quanteda"). If you want summary statistics, try quanteda.textstats::textstat_frequency(). Or if you want the dfm as a data.frame, use convert(dfm(dict_toks), to = "data.frame").

CodePudding user response：

Thanks very much. I just needed a way to display the output, and could not find an example, so this is very helpful. I changed the code I have used to:

'corp_platform <- corpus(corp)
 toks_platform <- quanteda::tokens(corp_platform)
 dict_toks <- tokens_lookup(toks_platform, dictionary = dict)
 print(dict_toks)
 dfm(dict_toks)
 convert(dfm(dict_toks), to = "data.frame")'

and the output is:

   'doc_        id commmunitarian individualist
    1      akdem20.txt            113            20
    2      azdem20.txt             60            13
    3      cadem20.txt            254            98
    4      medem20.txt             27             7
    5      mndfl20.txt             40            18
          .........................................
    25     tx2022draft.txt        198           156
    26     txgop20.txt            181           153
    27     wagop20.txt             52            63
    28     wigop20.txt             27            11
    29     wvgop20.txt             72            47
    30     wygop20.txt             22            21'