Home > Enterprise >  Compute descriptive statistics from one dataframe based on multiple conditions from another datafram
Compute descriptive statistics from one dataframe based on multiple conditions from another datafram

Time:10-19

I have two data frames: One contains distances between countries:

import pandas as pd

data = {
        'countryA': ['France', 'France', 'Germany', 'Germany', 'Spain', 'Spain', 'US', 'US', 'Canada', 'Canada', 'Mexico','Mexico'],
        'countryB': ['Germany', 'Spain', 'France', 'Spain', 'France', 'Germany', 'Canada', 'Mexico', 'US', 'Mexico', 'US', 'Canada'],
        'dist': [200, 100, 100, 300, 100, 300, 150, 230, 800, 150, 230, 800]
        }

df1 = pd.DataFrame(data)
df1
countryA    countryB    dist
0   France  Germany     200
1   France  Spain       100
2   Germany France      100
3   Germany Spain       300
4   Spain   France      100
5   Spain   Germany     300
6     US    Canada      150
7     US    Mexico      230
8     Canada    US      800
9   Canada    Mexico    150
10  Mexico      US      230
11  Mexico    Canada    800

Another that has data on their membership in international organizations. If a country is a member of the organization, in a particular year, it gets "1". Otherwise, it gets "0".

data = {'year': [2000, 2001, 2002, 2000, 2001, 2002],
        'ioname': ['EU', 'EU', 'EU', 'NAFTA', 'NAFTA', 'NAFTA'],
        'France': [1,1,1,0,0,0],
        'Germany': [1,1,1,0,0,0],
        'Spain': [1,1,1,0,0,0],
        'US': [0,0,0,1,1,1],
        'Canada': [0,0,0,1,1,1],
        'Mexico': [0,0,0,1,1,1]
        }

df2 = pd.DataFrame(data)
df2

    year    ioname  France  Germany Spain   US  Canada  Mexico
0   2000      EU      1        1      1     0      0    0
1   2001      EU      1        1      1     0      0    0
2   2002      EU      1        1      1     0      0    0
3   2000      NAFTA   0        0      0     1      1    1
4   2001      NAFTA   0        0      0     1      1    1
5   2002      NAFTA   0        0      0     1      1    1

I wish to compute the mean distance between all the members of a given organization, like so:

data = {'year': [2000, 2001, 2002, 2000, 2001, 2002],
        'ioname': ['EU', 'EU', 'EU', 'NAFTA', 'NAFTA', 'NAFTA'],
        'mean_distance': [200, 200, 200, 360, 360, 360]
        }

df3 = pd.DataFrame(data)
df3

    year    ioname  mean_distance
0   2000      EU     200
1   2001      EU     200
2   2002      EU     200
3   2000      NAFTA  360
4   2001      NAFTA  360
5   2002      NAFTA  360

How can I achieve this?

CodePudding user response:

Use melt to reshape your dataframe df2 in order to merge with df1 then compute mean distance for each (year, ioname).

out = df2.melt(id_vars=['year', 'ioname'],
               var_name='countryA',
               value_name='is_member') \
         .query('is_member == 1')[['year', 'ioname', 'countryA']] \
         .merge(df1[['countryA', 'dist']], on='countryA') \
         .groupby(['year', 'ioname'])['dist'].mean() \
         .sort_index(level=1).reset_index()

Output result:

>>> out
   year ioname        dist
0  2000     EU  183.333333
1  2001     EU  183.333333
2  2002     EU  183.333333
3  2000  NAFTA  393.333333
4  2001  NAFTA  393.333333
5  2002  NAFTA  393.333333
  • Related