Home > Enterprise >  compute n-grams by category column with pandas
compute n-grams by category column with pandas

Time:02-22

I'm trying to find the most used n-grams of a pandas column in python. I managed to gather the following code allowing me to do exactly that.

However I would like to have the results split by "category" column. Instead of having a line with bi-gram|total frequency like

"blue orange"|1

I would like three columns of bi-gram|frequency fruit|frequency|meat like

"blue orange"|1|0

from sklearn.feature_extraction.text import CountVectorizer

data = {'text':['blue orange is tired', 'an apple', 'meat are great for my stomach'],
        'category':['fruit', 'fruit', 'meat']}
df = pd.DataFrame(data)

word_vectorizer = CountVectorizer(ngram_range = (2, 3), analyzer = 'word')
sparse_matrix = word_vectorizer.fit_transform(df['text'])
frequencies = sum(sparse_matrix).toarray()[0]
df_ngrams = pd.DataFrame(frequencies, index = word_vectorizer.get_feature_names_out(), columns = ['frequency'])
df_ngrams.sort_values('frequency', ascending = False).head(50)

CodePudding user response:

Refactoring your code into a function you can apply it per group:

def compute_ngram_freq(df):
    word_vectorizer = CountVectorizer(ngram_range = (2, 3), analyzer = 'word')
    sparse_matrix = word_vectorizer.fit_transform(df['text'])
    frequencies = sum(sparse_matrix).toarray()[0]
    df_ngrams = pd.DataFrame(frequencies, index = word_vectorizer.get_feature_names_out(), columns = ['frequency'])
    return df_ngrams.sort_values('frequency', ascending = False)


out = df.groupby('category').apply(compute_ngram_freq).unstack(level=0, fill_value=0)

output:

                frequency     
category            fruit meat
an apple                1    0
are great               0    1
are great for           0    1
blue orange             1    0
blue orange is          1    0
for my                  0    1
for my stomach          0    1
great for               0    1
great for my            0    1
is tired                1    0
meat are                0    1
meat are great          0    1
my stomach              0    1
orange is               1    0
orange is tired         1    0
  • Related