I have two data frames: One contains distances between countries:
import pandas as pd
data = {
'countryA': ['France', 'France', 'Germany', 'Germany', 'Spain', 'Spain', 'US', 'US', 'Canada', 'Canada', 'Mexico','Mexico'],
'countryB': ['Germany', 'Spain', 'France', 'Spain', 'France', 'Germany', 'Canada', 'Mexico', 'US', 'Mexico', 'US', 'Canada'],
'dist': [200, 100, 100, 300, 100, 300, 150, 230, 800, 150, 230, 800]
}
df1 = pd.DataFrame(data)
df1
countryA countryB dist
0 France Germany 200
1 France Spain 100
2 Germany France 100
3 Germany Spain 300
4 Spain France 100
5 Spain Germany 300
6 US Canada 150
7 US Mexico 230
8 Canada US 800
9 Canada Mexico 150
10 Mexico US 230
11 Mexico Canada 800
Another that has data on their membership in international organizations. If a country is a member of the organization, in a particular year, it gets "1". Otherwise, it gets "0".
data = {'year': [2000, 2001, 2002, 2000, 2001, 2002],
'ioname': ['EU', 'EU', 'EU', 'NAFTA', 'NAFTA', 'NAFTA'],
'France': [1,1,1,0,0,0],
'Germany': [1,1,1,0,0,0],
'Spain': [1,1,1,0,0,0],
'US': [0,0,0,1,1,1],
'Canada': [0,0,0,1,1,1],
'Mexico': [0,0,0,1,1,1]
}
df2 = pd.DataFrame(data)
df2
year ioname France Germany Spain US Canada Mexico
0 2000 EU 1 1 1 0 0 0
1 2001 EU 1 1 1 0 0 0
2 2002 EU 1 1 1 0 0 0
3 2000 NAFTA 0 0 0 1 1 1
4 2001 NAFTA 0 0 0 1 1 1
5 2002 NAFTA 0 0 0 1 1 1
I wish to compute the mean distance between all the members of a given organization, like so:
data = {'year': [2000, 2001, 2002, 2000, 2001, 2002],
'ioname': ['EU', 'EU', 'EU', 'NAFTA', 'NAFTA', 'NAFTA'],
'mean_distance': [200, 200, 200, 360, 360, 360]
}
df3 = pd.DataFrame(data)
df3
year ioname mean_distance
0 2000 EU 200
1 2001 EU 200
2 2002 EU 200
3 2000 NAFTA 360
4 2001 NAFTA 360
5 2002 NAFTA 360
How can I achieve this?
CodePudding user response:
Use melt
to reshape your dataframe df2
in order to merge with df1
then compute mean distance for each (year, ioname)
.
out = df2.melt(id_vars=['year', 'ioname'],
var_name='countryA',
value_name='is_member') \
.query('is_member == 1')[['year', 'ioname', 'countryA']] \
.merge(df1[['countryA', 'dist']], on='countryA') \
.groupby(['year', 'ioname'])['dist'].mean() \
.sort_index(level=1).reset_index()
Output result:
>>> out
year ioname dist
0 2000 EU 183.333333
1 2001 EU 183.333333
2 2002 EU 183.333333
3 2000 NAFTA 393.333333
4 2001 NAFTA 393.333333
5 2002 NAFTA 393.333333