Home > front end >  Match 2 grams on data then sum their values
Match 2 grams on data then sum their values


I'm a Python beginner. I had inspired by some Python courses. This is the example CSV file below.

Name Location Number
Andrew Platt Andrew A B C 100
Steven Thunder Andrew A B C 50
Jeff England Steven A B C 30
Andrew England Jeff A B C 30

I want to get a result like that:

['Andrew': 180,
 'Platt': 100,
 'Steven': 80,
 'Jeff': 60,
 'England': 60,
 'Andrew Platt': 100,
 'Platt Andrew': 100,
 'Steven Thunder': 50,
 'Thunder Andrew': 50,
 'Jeff England': 30,
 'England Steven': 30,
 'Andrew England': 30,
 'England Jeff': 30]

one_words, e.g. 'Andrew', as it shows rows 1, 2 and 4, so the result is 180 (100 50 30)
two_words, e.g. 'Andrew Platt', as it shows row 1 only, so the result is 100

This is my tried below:

from collections import Counter
from itertools import combinations
import itertools
from pprint import pprint
import pandas as pd

    "Andrew Platt Andrew;100,",
    "Steven Thunder Andrew;50",
    "Jeff England Steven;30",
    "Andrew England Jeff;30"

df=[n.split(";") for n in data[0:]]
items=" ".join(df.Name).split()

for item in set(items):
    one_words[item]  = df.loc[df.Name.str.contains(item)].Number.astype('int').sum()
for two_word in combinations(items, 2):
    if len(set(two_word)) == 1:
    two_words[" ".join(two_word)]  = df.loc[df.Name.str.contains(item)].Number.astype('int').sum()


My result:

Counter({'Andrew': 180,
         'Platt': 100,
         'Steven': 80,
         'Jeff': 60,
         'England': 60,
         'Thunder': 50})
Counter({'Andrew Platt': 100,
         'Platt Andrew': 100,
         'Steven Thunder': 50,
         'Steven Andrew': 50,
         'Thunder Andrew': 50,
         'Jeff England': 30,
         'Jeff Steven': 30,
         'England Steven': 30,
         'Andrew England': 30,
         'Andrew Jeff': 30,
         'England Jeff': 30})

For two_words, like [a,b,c], output shuld be [[a,b],[b,c]], that means should not have like Steven Andrew': 50,

Python version is 3.8.13

CodePudding user response:

Here's something that'll do it.

from itertools import chain, pairwise
df['Key'] = df['Name'].apply(lambda x: list(set(chain.from_iterable((x.split(' '), map(' '.join, pairwise(x.split(' '))))))))


{'Andrew': 180,
 'Andrew England': 30,
 'Andrew Platt': 100,
 'England': 60,
 'England Jeff': 30,
 'England Steven': 30,
 'Jeff': 60,
 'Jeff England': 30,
 'Platt': 100,
 'Platt Andrew': 100,
 'Steven': 80,
 'Steven Thunder': 50,
 'Thunder': 50,
 'Thunder Andrew': 50}

CodePudding user response:

Your code seems good, just replace itertools.combinations by itertools.pairwise (python ≥3.10, or use zip(items, items[1:])).

For a solution:

out = (df.assign(L=[list(dict.fromkeys((l:=x.split()))) list(map(' '.join, zip(l, l[1:]))) for x in df['Name']])
       .explode('L').groupby('L', sort=False)['Number']
       .sum().sort_index(key=lambda x: x.str.count(' ')).to_dict()

For a two steps solution like your original code:

df2 = df.assign(L=[list(zip((l:=x.split()), l[1:])) for x in df['Name']]).explode('L')

pd.concat([df2.explode('L').drop_duplicates(['Name', 'L']).groupby('L', sort=False)['Number'].sum(),
           df2.assign(L=df2['L'].agg(' '.join)).groupby('L', sort=False)['Number'].sum()]).to_dict()


{'Andrew': 180,
 'Platt': 100,
 'Steven': 80,
 'Thunder': 50,
 'Jeff': 60,
 'England': 60,
 'Andrew Platt': 100,
 'Platt Andrew': 100,
 'Steven Thunder': 50,
 'Thunder Andrew': 50,
 'Jeff England': 30,
 'England Steven': 30,
 'Andrew England': 30,
 'England Jeff': 30}
  • Related