Count unique words with collections and dataframe-CodePudding

I have a problem, I want to count the unique words from a dataframe, but unfortunately it only counts the first sentences.

                          text
0  hello is a unique sentences
1         hello this is a test
2              does this works

import pandas as pd
d = {
    "text": ["hello is a unique sentences",
             "hello this is a test", 
             "does this works"],
}
df = pd.DataFrame(data=d)


from collections import Counter

# Count unique words
def counter_word(text_col):
    print(len(text_col.values))
    count = Counter()
    for i, text in enumerate(text_col.values):
        print(i)
        for word in text.split():
            count[word]  = 1
        return count

counter = counter_word(df['text'])
len(counter)

CodePudding user response：

I think simplier is join values by space, then split for words and count:

counter = Counter((' '.join(df['text'])).split())

print (counter)
Counter({'hello': 2, 'is': 2, 'a': 2, 'this': 2, 'unique': 1, 'sentences': 1, 'test': 1, 'does': 1, 'works': 1})

CodePudding user response：

You can use itertools.chain to have a generator to feed to Counter:

from itertools import chain
counter = Counter(chain.from_iterable(map(str.split, df['text'])))

output:

Counter({'hello': 2,
         'is': 2,
         'a': 2,
         'unique': 1,
         'sentences': 1,
         'this': 2,
         'test': 1,
         'does': 1,
         'works': 1})

CodePudding user response：

It may be easier and more efficient to stack the words into a single column then use pandas value_counts to count them, instead of Counter:

df["text"].str.split(expand=True).stack().value_counts()