Home > Enterprise >  Compute descriptive statistics from one dataframe based on conditions from another dataframe (runnin
Compute descriptive statistics from one dataframe based on conditions from another dataframe (runnin

Time:10-19

I have two data frames: one that that has data on GDP of countries:

    import pandas as pd

data = {'year': [2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002],
        'country': ['France', 'France', 'France', 'Germany', 'Germany', 'Germany', 'US', 'US', 'US', "Canada", "Canada","Canada"],
        'GDP': [100, 150, 165, 300, 315, 318, 700, 789, 854, 320,313, 324]
        }

df1 = pd.DataFrame(data)
df1

    year    country GDP
0   2000    France  100
1   2001    France  150
2   2002    France  165
3   2000    Germany 300
4   2001    Germany 315
5   2002    Germany 318
6   2000    US      700
7   2001    US      789
8   2002    US      854
9   2000    Canada  320
10  2001    Canada  313
11  2002    Canada  324

Another that has data on their membership in international organizations. If a country is a member of the organization, in a particular year, it gets "1". Otherwise, it gets "0".

data = {'year': [2000, 2001, 2002, 2000, 2001, 2002],
        'ioname': ['EU', 'EU', 'EU', 'NAFTA', 'NAFTA', 'NAFTA'],
        'France': [1,1,1,0,0,0],
        'Germany': [1,1,1,0,0,0],
        'US': [0,0,0,1,1,1],
        'Canada': [0,0,0,1,1,1],
        }

df2 = pd.DataFrame(data)
df2

year    ioname  France  Germany US  Canada
0   2000    EU      1      1    0     0
1   2001    EU      1      1    0     0
2   2002    EU      1      1    0     0
3   2000    NAFTA   0      0    1     1
4   2001    NAFTA   0      0    1     1
5   2002    NAFTA   0      0    1     1

I wish to compute the mean GDP for each organization's members. For example, for the EU we should use the values of Germany and France only. This should be the final outcome:

data = {'year': [2000, 2001, 2002, 2000, 2001, 2002],
        'ioname': ['EU', 'EU', 'EU', 'NAFTA', 'NAFTA', 'NAFTA'],
        'France': [1,1,1,0,0,0],
        'Germany': [1,1,1,0,0,0],
        'US': [0,0,0,1,1,1],
        'Canada': [0,0,0,1,1,1],
        'mean_gdp': [200, 232.5, 241.5, 510, 551, 589]
        }

df3 = pd.DataFrame(data)
df3

    year    ioname  France  Germany US  Canada  mean_gdp
0   2000      EU       1       1    0     0      200.0
1   2001      EU       1       1    0     0      232.5
2   2002      EU       1       1    0     0      241.5
3   2000    NAFTA      0       0    1     1      510.0
4   2001    NAFTA      0       0    1     1      551.0
5   2002    NAFTA      0       0    1     1      589.0

How can I do this?

CodePudding user response:

Use DataFrame.pivot for possible multiple values in df2 with convert both columns to MultiIndex, replace 0 to NaNs, so possible create mean, last add new column by DataFrame.join:

df = df1.pivot('year','country','GDP')
s = df2.set_index(['year','ioname']).mul(df, level=0).replace(0, np.nan).mean(axis=1)


df = df2.join(s.rename('mean_gdp'), on=['year','ioname'])
print (df)
   year ioname  France  Germany  US  Canada  mean_gdp
0  2000     EU       1        1   0       0     200.0
1  2001     EU       1        1   0       0     232.5
2  2002     EU       1        1   0       0     241.5
3  2000  NAFTA       0        0   1       1     510.0
4  2001  NAFTA       0        0   1       1     551.0
5  2002  NAFTA       0        0   1       1     589.0
  • Related