trigrams with all the words as unique python-CodePudding

I have a pandas dataframe where one of the column contains list of Trigrams. The dataframe is huge.

df:

val    Trigrams
 1     ['status opportunity reference', 'reject remove reject', 'situation situation hon', 'Good Sure Cable']
 2     ['don bradman sir', 'manners maketh man', 'gold chain chain'']

Below mentioned is a sample row value of the Trigram column:

['status opportunity reference', 'reject remove reject', 'situation situation hon', 'Good Sure Cable']

If you see the 2nd element "reject remove reject" the first word is equal to 3rd word.

If you see the 3rd element "situation situation hon" first word is equal to 2nd word.

What I'm looking for is removing such trigrams from the list and keeping only those trigrams where all the 3 words are unique.

In this case the output would be:

['status opportunity reference', 'Good Sure Cable']

I have written a custom for loop:

new_list = []
for trigram in list_of_trigrams:
    if trigram[0] != trigram[1] and trigram[1] != trigram[2] and trigram[0]!=trigram[2]:
        new_list.append(trigram)

And converting this logic to UDF i can use .apply function in pandas to get the output. But this isn't a correct way, as I have to do it for millions of rows.

What I'm looking for is a pythonic way of doing this, quickly!

CodePudding user response：

Use:

df = pd.DataFrame({'val':[1,2],
                   'Trigrams': [['status opportunity reference', 'reject remove reject', 'situation situation hon', 'Good Sure Cable'],
                                ['don bradman sir', 'manners maketh man', 'gold chain chain']]}, 
                  )

df['Trigrams'] = [[y for y in x if len(set(y.split())) == 3] for x in df['Trigrams']]

alternative:

df['Trigrams'] = df['Trigrams'].apply(lambda x: [y for y in x if len(set(y.split())) == 3])

print(df)
   val                                         Trigrams
0    1  [status opportunity reference, Good Sure Cable]
1    2            [don bradman sir, manners maketh man]

Performance for 200k rows:

df = pd.DataFrame({'val':[1,2],
                   'Trigrams': [['status opportunity reference', 'reject remove reject', 'situation situation hon', 'Good Sure Cable'],
                                ['don bradman sir', 'manners maketh man', 'gold chain chain']]}, 
                  )
df = pd.concat([df] * 100000, ignore_index=True)


df['Trigrams1'] = [[y for y in x if len(set(y.split())) == 3] for x in df['Trigrams']]
df['Trigrams2'] = df['Trigrams'].apply(lambda x: [y for y in x if len(set(y.split())) == 3])



In [96]: %timeit df['Trigrams1'] = [[y for y in x if len(set(y.split())) == 3] for x in df['Trigrams']]
627 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [97]: %timeit df['Trigrams2'] = df['Trigrams'].apply(lambda x: [y for y in x if len(set(y.split())) == 3])
520 ms ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

One approach

import pandas as pd

df = pd.DataFrame(
    [[['status opportunity reference', 'reject remove reject', 'situation situation hon', 'Good Sure Cable']],
     [['don bradman sir', 'manners maketh man', 'gold chain chain']]], columns=["trigrams"])
res = [[trigram for trigram in trigrams if len(set(trigram.split())) == 3] for trigrams in df["trigrams"]]
print(res)

Output

[['status opportunity reference', 'Good Sure Cable'], ['don bradman sir', 'manners maketh man']]