Type-Token Ratio in Google Sheets: How to manipulate long strings of text (millions of characters)-CodePudding

Here’s the challenge. In a Google Sheets spreadsheet, I have a column in which can be found a range of cells containing lists of words separated by comas, one per row, up to a thousand row. Each list show the words taken from a text, in alpha-numeral order, from a few hundred to a few thousand words. I need to count both the total of words in all the rows, taken together, and the number of unique word forms too. In other words, from the glossary of natural language processing, I want to know the number of tokens and the number of types in my corpus, in order to calculate the type-token ratio or lexical density.

In particular, finding the number of unique word forms in the whole column have proven to be a challenge. In an ARRAY FORMULA, with corresponding functions, I’ve JOINED the strings, SPLITED the words, TRANSPOSED them, then removed duplicates with UNIQUE function, then counted the remaining word forms. This worked on a sample corpus constituted of a little over ten lists of words, but failed when I reached fifteen or so lists of words taken together, a far cry from the thousand lists I need to join in my formula to obtain the results I am looking for.

From what I can gather, the problem would reside in that the resulting string I intend to manipulate is exceeding 50,000 characters. Here and there, for specific cases, I’ve found similar questions, and propositions for workarounds, mostly through custom functions, but I could not replicate the result. Needless to say, writing custom fonctions on my own is beyond my reach. Someone suggests to use QUERY headers, but I did not figured either if this was of any help in my case.

The formulas I came up with are the following:

To obtain the total number of words (tokens) through all the lists: =COUNTA(ARRAYFORMULA(SPLIT(JOIN(",";1;B2:B);",")))

To obtain the number of unique word forms (types) through all the lists: =COUNTA(ARRAYFORMULA(UNIQUE(TRANSPOSE(SPLIT(JOIN(",";1;B2:B);",")))))

A sample in a spreadsheet can be found here.

EDIT 1:

I’ve included the column of texts stripped of ponctuation, from which the lists of words are generated, and the formula used to generate them.

EDIT 2:

Changed the title to better reflect the general intent.

CodePudding user response：

For total items, try:

=arrayformula(query(flatten(iferror(split(B2:B;",";1);));"select count(Col1) where Col1 !='' label count(Col1) '' ";0))

For total unique items:

=arrayformula(query(unique(flatten(iferror(split(B2:B;",";1);)));"select count(Col1) where Col1 !='' label count(Col1) '' ";0))

You might get problems if you have too many rows in the sheet. If so, set the range limit to something like B2:B1000

Add this to cell C1 to get a list of 'Comma separated items':

=arrayformula({"Comma separated items";if(B2:B<>"";len(regexreplace(B2:B;"[^\,]";)) 1;)})

Explanation:

The arrayformula() allows the calculation to cascade down the sheet, from one cell.

So within the arrayformula(), the starting point is the split(B2:B;",") to create columns for each of the comma separated items.

The iferror(split(B2:B;",");"") leaves a blank where cells don't have a comma (like those from row 32). Instead of ;"") shown above I usually just use ;), removing "" so nothing is the result of the iferror.

Then flatten() takes all of the columns and flattens them into a single column.

query() is needed to count the resulting column count(Col1) where no cells are empty where Col1 !='', and the label count(Col1) '' removea a label 'count' which would usually be displayed.

For the list of unique values, unique() is placed before thequery(), after the flatten().