Home > OS >  How to aggregate dataframe into a row
How to aggregate dataframe into a row

Time:12-06

Given two dataframes df_1 and df_2, how to aggregate values of df_2 into rows of df_1 such that date in df_1 is between open and close in df_2

print df_1

  date          A          B
0 2021-11-01    0.020228   0.026572
1 2021-11-02    0.057780   0.175499
2 2021-11-03    0.098808   0.620986
3 2021-11-04    0.158789   1.014819
4 2021-11-05    0.038129   2.384590


print df_2

  open        close       location     division     size    
0 2021-11-07  2021-11-14  LDN          Alpha        120
1 2021-11-01  2021-11-14  PRS          Alpha        450
2 2021-10-14  2021-11-27  HK           Beta         340

I have tried this solution to joining my dataframes, now I need to find a way to aggregate. What I did so far is:

df_2.index = pd.IntervalIndex.from_arrays(df_2['open'],df_2['close'],closed='both')
df_1['events'] = df_1['date'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)])


print(calls['code'].iloc[0].groupby(['location', 'division'])['size'].sum())

location  division              
LDN       Alpha                     421.0
LDN       Beta                      515.0
NY        Alpha                     369.0
PRQ       Alpha                     132.0
          Gamma                     110.0

I need something that looks like this:

  date          A          B          LDN_Alpha   LDN_Beta   LDN_Gamma   PRS_Alpha   ...
0 2021-11-01    0.020228   0.026572   120         300        0           530
1 2021-11-02    0.057780   0.175499   ...
2 2021-11-03    0.098808   0.620986
3 2021-11-04    0.158789   1.014819
4 2021-11-05    0.038129   2.384590

Where the created columns are the sum of size grouped by location and division

CodePudding user response:

Idea is first repeat date range by open and close columns, add original columns from df_2 and then use DataFrame.pivot_table with DataFrame.join:

df_1['date'] = pd.to_datetime(df_1['date'])

s=pd.concat([pd.Series(r.Index,pd.date_range(r.open, r.close)) for r in df_2.itertuples()])
df = df_2.join(pd.Series(s.index, s).rename('date'))

df = df.pivot_table(index='date', 
                    columns=['location', 'division'], 
                    values='size', 
                    aggfunc='sum', 
                    fill_value=0)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')

df = df_1.join(df, on='date')
print (df)
        date         A         B  HK_Beta  LDN_Alpha  PRS_Alpha
0 2021-11-01  0.020228  0.026572      340          0        450
1 2021-11-02  0.057780  0.175499      340          0        450
2 2021-11-03  0.098808  0.620986      340          0        450
3 2021-11-04  0.158789  1.014819      340          0        450
4 2021-11-05  0.038129  2.384590      340          0        450
  • Related