Home > Software design >  Find the most popular word order in a Pandas dataframe
Find the most popular word order in a Pandas dataframe

Time:09-27

I'm trying to find the most common word order in a pandas dataframe for strings which occur more than once.

Example Dataframe

                                         title
0                             Men's Nike Socks
1                             Nike Socks Men's
2                       Men's Black Nike Socks
3                             Men's Nike Socks
4  Everyday 3 Pack Cotton Cushioned Crew Socks

Desired Output

Men's Nike Socks

This is because each word occurs more than once, arranged in the most common order.

What I've Tried

I thought one way to tackle this is to assign a score for each word position, e.g. first position = high score, a low position (further right in the sentence = lower score).

I considered counting the maximum number of words which appear in the dataframe and then use that to incrementally score the words based on their frequency and position.

I Python beginner and not sure how to progress further than that.

It's worth mentioning that the word sizes will be random, and not constrained to the example above.

Minimum Reproducible Example

import pandas as pd

data = [
    "Men's Nike Socks for sale",
    "Nike Socks Men's",
    "Men's Nike Socks in the UK",
    "Men's Nike Socks to buy",
    "Everyday 3 Pack Cotton Cushioned Crew Socks",
]

df = pd.DataFrame(data, columns=['title'])

print(df)

Edit: My original example is too simplified as my desired output appeared twice exactly in the dataframe.

I've updated the dataframe, but the desired output is still the same.

CodePudding user response:

Use value_counts() and idxmax().

result = df['title'].value_counts().idxmax()
print(result)

Output: Men's Nike Socks

Explanation:

>>> df['title'].value_counts()

Men's Nike Socks                               2
Nike Socks Men's                               1
Men's Black Nike Socks                         1
Everyday 3 Pack Cotton Cushioned Crew Socks    1
Name: title, dtype: int64

Update base new DataFrame:

max_split = df['title'].str.split().apply(len).max()
for i in range(1, max_split):
    try:
        result = df['title'].str.split(' ', i, expand=True).iloc[:, :-1].apply(' '.join, axis=1).mode()[0]
    except TypeError:
        break
print(result)

Output: Men's Nike Socks

CodePudding user response:

You can use pandas.Series.mode that returns the most frequent value in a column/serie :

out = df['title'].mode()

# Output :

print(out)

0    Men's Nike Socks
Name: title, dtype: object

# Edit :

To find the most frequent phrases in a column, use nltk as shown in the code below (highly inspired by @jezarel) :

from nltk import ngrams

vals = [y for x in df['title'] for y in x.split()]
n = [3, 4, 5] # Phrases between 3 and 5 words, To be adjusted !

out = pd.Series([' '.join(y) for x in n for y in ngrams(vals, x)]).value_counts().idxmax()


print(out)
Men's Nike Socks
  • Related