Match 2 grams on data then sum their values-CodePudding

I'm a Python beginner. I had inspired by some Python courses. This is the example CSV file below.

Name	Location	Number
Andrew Platt Andrew	A B C	100
Steven Thunder Andrew	A B C	50
Jeff England Steven	A B C	30
Andrew England Jeff	A B C	30

I want to get a result like that:

['Andrew': 180,
 'Platt': 100,
 'Steven': 80,
 'Jeff': 60,
 'England': 60,
 'Thunder':50,
 'Andrew Platt': 100,
 'Platt Andrew': 100,
 'Steven Thunder': 50,
 'Thunder Andrew': 50,
 'Jeff England': 30,
 'England Steven': 30,
 'Andrew England': 30,
 'England Jeff': 30]

Logic:
one_words, e.g. 'Andrew', as it shows rows 1, 2 and 4, so the result is 180 (100 50 30)
two_words, e.g. 'Andrew Platt', as it shows row 1 only, so the result is 100

This is my tried below:

from collections import Counter
from itertools import combinations
import itertools
from pprint import pprint
import pandas as pd

data=[
    "Andrew Platt Andrew;100,",
    "Steven Thunder Andrew;50",
    "Jeff England Steven;30",
    "Andrew England Jeff;30"
    ]

one_words=Counter()
two_words=Counter()
df=[n.split(";") for n in data[0:]]
df=pd.DataFrame(df[1:],columns=df[0])
df.columns=('Name','Number')
df=df.replace('\,','',regex=True)
df['Number']=df.Number.replace('W','',regex=True)
items=" ".join(df.Name).split()

for item in set(items):
    one_words[item]  = df.loc[df.Name.str.contains(item)].Number.astype('int').sum()
for two_word in combinations(items, 2):
    if len(set(two_word)) == 1:
        continue
    two_words[" ".join(two_word)]  = df.loc[df.Name.str.contains(item)].Number.astype('int').sum()

pprint(one_words)
pprint(two_words)

My result:

Counter({'Andrew': 180,
         'Platt': 100,
         'Steven': 80,
         'Jeff': 60,
         'England': 60,
         'Thunder': 50})
Counter({'Andrew Platt': 100,
         'Platt Andrew': 100,
         'Steven Thunder': 50,
         'Steven Andrew': 50,
         'Thunder Andrew': 50,
         'Jeff England': 30,
         'Jeff Steven': 30,
         'England Steven': 30,
         'Andrew England': 30,
         'Andrew Jeff': 30,
         'England Jeff': 30})

Problem:
For two_words, like [a,b,c], output shuld be [[a,b],[b,c]], that means should not have like Steven Andrew': 50,

Note:
Python version is 3.8.13

CodePudding user response：

Here's something that'll do it.

from itertools import chain, pairwise
df['Key'] = df['Name'].apply(lambda x: list(set(chain.from_iterable((x.split(' '), map(' '.join, pairwise(x.split(' '))))))))
df.explode('Key').groupby('Key').Number.agg('sum').to_dict()

Outputs

{'Andrew': 180,
 'Andrew England': 30,
 'Andrew Platt': 100,
 'England': 60,
 'England Jeff': 30,
 'England Steven': 30,
 'Jeff': 60,
 'Jeff England': 30,
 'Platt': 100,
 'Platt Andrew': 100,
 'Steven': 80,
 'Steven Thunder': 50,
 'Thunder': 50,
 'Thunder Andrew': 50}

CodePudding user response：

Your code seems good, just replace itertools.combinations by itertools.pairwise (python ≥3.10, or use zip(items, items[1:])).

For a pandas solution:

out = (df.assign(L=[list(dict.fromkeys((l:=x.split()))) list(map(' '.join, zip(l, l[1:]))) for x in df['Name']])
       .explode('L').groupby('L', sort=False)['Number']
       .sum().sort_index(key=lambda x: x.str.count(' ')).to_dict()
       )

For a two steps solution like your original code:

df2 = df.assign(L=[list(zip((l:=x.split()), l[1:])) for x in df['Name']]).explode('L')

pd.concat([df2.explode('L').drop_duplicates(['Name', 'L']).groupby('L', sort=False)['Number'].sum(),
           df2.assign(L=df2['L'].agg(' '.join)).groupby('L', sort=False)['Number'].sum()]).to_dict()

Output:

{'Andrew': 180,
 'Platt': 100,
 'Steven': 80,
 'Thunder': 50,
 'Jeff': 60,
 'England': 60,
 'Andrew Platt': 100,
 'Platt Andrew': 100,
 'Steven Thunder': 50,
 'Thunder Andrew': 50,
 'Jeff England': 30,
 'England Steven': 30,
 'Andrew England': 30,
 'England Jeff': 30}