I'm a Python beginner. I had inspired by some Python courses. This is the example CSV file below.
Name | Location | Number |
---|---|---|
Andrew Platt Andrew | A B C | 100 |
Steven Thunder Andrew | A B C | 50 |
Jeff England Steven | A B C | 30 |
Andrew England Jeff | A B C | 30 |
I want to get a result like that:
['Andrew': 180,
'Platt': 100,
'Steven': 80,
'Jeff': 60,
'England': 60,
'Thunder':50,
'Andrew Platt': 100,
'Platt Andrew': 100,
'Steven Thunder': 50,
'Thunder Andrew': 50,
'Jeff England': 30,
'England Steven': 30,
'Andrew England': 30,
'England Jeff': 30]
Logic:
one_words
, e.g. 'Andrew', as it shows rows 1, 2 and 4, so the result is 180 (100 50 30)
two_words
, e.g. 'Andrew Platt', as it shows row 1 only, so the result is 100
This is my tried below:
from collections import Counter
from itertools import combinations
import itertools
from pprint import pprint
import pandas as pd
data=[
"Andrew Platt Andrew;100,",
"Steven Thunder Andrew;50",
"Jeff England Steven;30",
"Andrew England Jeff;30"
]
one_words=Counter()
two_words=Counter()
df=[n.split(";") for n in data[0:]]
df=pd.DataFrame(df[1:],columns=df[0])
df.columns=('Name','Number')
df=df.replace('\,','',regex=True)
df['Number']=df.Number.replace('W','',regex=True)
items=" ".join(df.Name).split()
for item in set(items):
one_words[item] = df.loc[df.Name.str.contains(item)].Number.astype('int').sum()
for two_word in combinations(items, 2):
if len(set(two_word)) == 1:
continue
two_words[" ".join(two_word)] = df.loc[df.Name.str.contains(item)].Number.astype('int').sum()
pprint(one_words)
pprint(two_words)
My result:
Counter({'Andrew': 180,
'Platt': 100,
'Steven': 80,
'Jeff': 60,
'England': 60,
'Thunder': 50})
Counter({'Andrew Platt': 100,
'Platt Andrew': 100,
'Steven Thunder': 50,
'Steven Andrew': 50,
'Thunder Andrew': 50,
'Jeff England': 30,
'Jeff Steven': 30,
'England Steven': 30,
'Andrew England': 30,
'Andrew Jeff': 30,
'England Jeff': 30})
Problem:
For two_words
, like [a,b,c]
, output shuld be [[a,b],[b,c]]
, that means should not have like Steven Andrew': 50,
Note:
Python version is 3.8.13
CodePudding user response:
Here's something that'll do it.
from itertools import chain, pairwise
df['Key'] = df['Name'].apply(lambda x: list(set(chain.from_iterable((x.split(' '), map(' '.join, pairwise(x.split(' '))))))))
df.explode('Key').groupby('Key').Number.agg('sum').to_dict()
Outputs
{'Andrew': 180,
'Andrew England': 30,
'Andrew Platt': 100,
'England': 60,
'England Jeff': 30,
'England Steven': 30,
'Jeff': 60,
'Jeff England': 30,
'Platt': 100,
'Platt Andrew': 100,
'Steven': 80,
'Steven Thunder': 50,
'Thunder': 50,
'Thunder Andrew': 50}
CodePudding user response:
Your code seems good, just replace itertools.combinations
by itertools.pairwise
(python ≥3.10, or use zip(items, items[1:])
).
For a pandas solution:
out = (df.assign(L=[list(dict.fromkeys((l:=x.split()))) list(map(' '.join, zip(l, l[1:]))) for x in df['Name']])
.explode('L').groupby('L', sort=False)['Number']
.sum().sort_index(key=lambda x: x.str.count(' ')).to_dict()
)
For a two steps solution like your original code:
df2 = df.assign(L=[list(zip((l:=x.split()), l[1:])) for x in df['Name']]).explode('L')
pd.concat([df2.explode('L').drop_duplicates(['Name', 'L']).groupby('L', sort=False)['Number'].sum(),
df2.assign(L=df2['L'].agg(' '.join)).groupby('L', sort=False)['Number'].sum()]).to_dict()
Output:
{'Andrew': 180,
'Platt': 100,
'Steven': 80,
'Thunder': 50,
'Jeff': 60,
'England': 60,
'Andrew Platt': 100,
'Platt Andrew': 100,
'Steven Thunder': 50,
'Thunder Andrew': 50,
'Jeff England': 30,
'England Steven': 30,
'Andrew England': 30,
'England Jeff': 30}