I have a list of unique tokens
unique_words
and a dataset column that has text
dataset['text']
I want to count how many times each element of unique_words appears in my entire text data and display k most common of those words.
unique_words = ['ab', 'bc', 'cd', 'de']
id | text |
---|---|
1x | ab cd de th sk gl wlqm dhwka oqbdm |
p2 | de de de lm eh nfkie qhas hof |
3 most common words:
'de', 100
'ab', 11
'cd', 5
CodePudding user response:
Method 1: Using pandas
This method using vectorized str
methods to
- split the string into tokens
- expand them and stack them
- use
value_counts()
to get frequency counts - filter indexes based on
unique_words
- fetch top k using
.head()
asvalue_counts()
already sort counts in descending order
import pandas as pd
unique_words = ['ab', 'cd', 'de', 'bc']
counts = dataset['text'].str.split(expand=True).stack().value_counts()
top3 = counts[counts.index.isin(unique_words)].head(3)
top3
de 4
ab 1
cd 1
dtype: int64
Method 2: Using sklearn's CountVectorizer
You can use CountVectorizer()
from sklearn to get token frequencies for the unique_words by setting them as your vocabulary.
Here is a code example for the sample dataset you have updated in your question.
- Initialize a
CountVectorizer
with the vocabulary set tounique_words
usingCountVectorizer(vocabulary=unique_words)
- Fit and transform the sentences in the
text
column using this vectorizer, and then convert it into an array, usingcnt.fit_transform(dataset['text']).toarray()
- Take the sum of the occurrences of each word in the vocab across the sentences by using
mat.sum(0)
- Finally, save it as a series, and use
.nlargest(3)
to get the top k keywords based on the frequency of occurrence across the dataset.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
unique_words = ['ab', 'cd', 'de', 'bc']
cnt = CountVectorizer(vocabulary=unique_words)
mat = cnt.fit_transform(dataset['text'])
tot = mat.sum(0)
top3 = pd.DataFrame(tot.T, index=unique_words)[0].nlargest(3)
top3
de 4
ab 1
cd 1
dtype: int64
Read more about sklearn's CountVectorizer here.
Method 3: Using collections.Counter
- First convert the series of sentences to a list using
.tolist()
- Next, map
str.split
to this iterator to break the sentences into tokens resulting in a list of lists - Next, use itertools.chain to merge these lists into a chain object
- Then use
Counter
to get word counts in this chain object for all tokens - Then, you can use a dict comprehension to get only those tokens that are in your
unique_words
list and convert it back to a Counter - Finally, use
counter.most_common(3)
to get the top k keys based on frequency.
from collections import Counter
from itertools import chain
counter = Counter(chain.from_iterable(map(str.split, dataset['text'].tolist())))
filtered = Counter({word: counter.get(word,0) for word in unique_words})
top3 = filtered.most_common(3)
top3
[('de', 4), ('ab', 1), ('cd', 1)]