I have two data frames
df1
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 100
1 11 2023 Lyon Chicago,Paris 200
2 11 2023 Berlin Paris 300
3 12 2022 Newyork Chicago 150
4 12 2022 Lyon Chicago,Paris 250
5 12 2022 Berlin Paris 400
df2
ID Year Primary_Location Sales
0 11 2023 Chicago 150
1 11 2023 Paris 200
2 12 2022 Chicago 300
3 12 2022 Paris 350
I would like for each group having the same ID
& Year
:
to add the column Sales
from df2
to Sales
in df1
where Primary_Location
in df2
appear (contained) in Secondary_Location
in df1
.
For example: For ID=11
& Year=2023
, Sales
for Lyon
would be added to Sales
for Chicago
& Sales
for Paris
of df_2
.
New Sales
of Paris
for that row would be 200 150 200=550.
The expected output would be :
df_primary_output
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 400
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
CodePudding user response:
Not so easy your question...
Proposed code
import pandas as pd
# Your dummy dataset
df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
'Year': [2023, 2023, 2023, 2022, 2022, 2022],
'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
'Sales': [100, 200, 300, 150, 250, 400]
})
df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
'Year': [2023, 2023, 2022, 2022],
'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
'Sales': [150, 200, 300, 350]
})
def get_dict(g):
return {g['Primary_Location'].iloc[i]:g['Sales'].iloc[i] for i in range(len(g))}
# return these dictionnaries
# {'Chicago': 100, 'Paris': 200}
# {'Chicago': 300, 'Paris': 350}
tot = []
def func(g):
global df1, tot
i, somme = [0]*2
iterdf = df1.iterrows() # Dataframe iterator
kv = get_dict(g) # Get dicts
while i < len(df1):
row = next(iterdf)[1]
# Select specific df1 rows to modify by ID and Year criteria
if g['ID'].iloc[1]==row['ID'] and g['Year'].iloc[1]==row['Year']:
somme = row['Sales']
for town in row['Secondary_Location'].split(','):
if town in kv:
somme =kv[town]
tot.append(somme)
i =1
df2.groupby(['ID', 'Year']).apply(lambda g: func(g))
df1['Sales'] = tot
print(df1)
Result :
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 450
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
Are you sure of the result in line 3, my script found 450 and not 400 ?
CodePudding user response:
This should work:
s = 'Secondary_Location'
(df.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)
.agg({**dict.fromkeys(df,'first'),**{s:','.join,'Sales_2':'sum'}})
.assign(Sales = lambda x: x['Sales'] x['Sales_2'])
.drop('Sales_2',axis=1))
Output:
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 450
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750