Home > OS >  Doing group calculations with two separate dataframes in python
Doing group calculations with two separate dataframes in python

Time:01-21

I have two pandas dataframes like this:

df1= pd.DataFrame({'sub-group':['2020','2030','2040','2030','2040','2030','2040'], 
                      'group':['a', 'a', 'a', 'b', 'b', 'c', 'c'],
                      'value1':[12,11,41,33,66,22,20]})

      sub-group  group  value1
      2020       a      12
      2030       a      11
      2040       a      41
      2030       b      33
      2040       b      66
      2030       c      22
      2040       c      20

df2= pd.DataFrame({'sub-group':['2020','2030','2040', '2020', '2030','2040','2030','2040'], 
                      'group':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c'],
                      'value2':[10,20,30,15,45,60,12,36]})

      sub-group group  value2
      2020      a      10
      2030      a      20
      2040      a      30
      2020      b      15
      2030      b      45
      2040      b      60
      2030      c      12
      2040      c      36

I want to find valu1/value2 for each group and sub-group. Note that the number of observations might not match in two dataframes. for example, we have 2020/b in df2 but not in df1. I those cases a nan or 0 would work.

I was thinking that it should be possible with pd.groupby but I don't know how it works with two dataframes. Thanks.

CodePudding user response:

For align groups and sub-group create MultiIndex Series by DataFrame.set_index, then divide by Series.div, last for 3 columns DataFrame add Series.reset_index:

df = (df1.set_index(['group','sub-group'])['value1']
         .div(df2.set_index(['group','sub-group'])['value2'])
         .reset_index(name='out'))
print (df)
  group sub-group       out
0     a      2020  1.200000
1     a      2030  0.550000
2     a      2040  1.366667
3     b      2020       NaN
4     b      2030  0.733333
5     b      2040  1.100000
6     c      2030  1.833333
7     c      2040  0.555556

CodePudding user response:

You can use a merge to align the Series:

df2.merge(df1, on=['group', 'sub-group'], how='outer').eval('new = value1/value2')

NB. I'm using eval here for a concise way to perform the computation, but you can also retrieve "value1"/"value2" and do whatever you want. Another way is df2.merge(df1, on=['group', 'sub-group'], how='outer').assign(new=lambda d: d['value1'].div(d['value2'])).

Output:

  sub-group group  value2  value1       new
0      2020     a      10    12.0  1.200000
1      2030     a      20    11.0  0.550000
2      2040     a      30    41.0  1.366667
3      2020     b      15     NaN       NaN
4      2030     b      45    33.0  0.733333
5      2040     b      60    66.0  1.100000
6      2030     c      12    22.0  1.833333
7      2040     c      36    20.0  0.555556

If you want a different order of the rows, just swap df1/df2:

df1.merge(df2, on=['group', 'sub-group'], how='outer').eval('new = value1/value2')

Output:

  sub-group group  value1  value2       new
0      2020     a    12.0      10  1.200000
1      2030     a    11.0      20  0.550000
2      2040     a    41.0      30  1.366667
3      2030     b    33.0      45  0.733333
4      2040     b    66.0      60  1.100000
5      2030     c    22.0      12  1.833333
6      2040     c    20.0      36  0.555556
7      2020     b     NaN      15       NaN

CodePudding user response:

You can just groupby and divide the two dataframes (I put the sum as aggregate method in case you have more rows with the same combination group and sub-group)

final_df = df1.groupby(['group', 'sub-group'])['value1'].sum()/df2.groupby(['group', 'sub-group'])['value2'].sum()

Output:

group  sub-group
a      2020         1.200000
       2030         0.550000
       2040         1.366667
b      2020              NaN
       2030         0.733333
       2040         1.100000
c      2030         1.833333
       2040         0.555556
  • Related