Home > OS >  How to aggregate of a datetime dataframe based on days and then how to calculate average?
How to aggregate of a datetime dataframe based on days and then how to calculate average?

Time:08-31

I have a dataframe with two columns, date and values.

import numpy as np
import pandas as pd
import datetime
from pandas import Timestamp


a = [[Timestamp('2014-06-17 00:00:00'), 0.023088847378082145],
 [Timestamp('2014-06-18 00:00:00'), -0.02137513226556209],
 [Timestamp('2014-06-19 00:00:00'), -0.023107608748262454],
 [Timestamp('2014-06-20 00:00:00'), -0.005373831609931101],
 [Timestamp('2014-06-23 00:00:00'), 0.0013989552359290336],
 [Timestamp('2014-06-24 00:00:00'), 0.02109937927428618],
 [Timestamp('2014-06-25 00:00:00'), -0.008350303722982733],
 [Timestamp('2014-06-26 00:00:00'), -0.037202662556428456],
 [Timestamp('2014-06-27 00:00:00'), 0.00019764611153205713],
 [Timestamp('2014-06-30 00:00:00'), 0.003260577288983324],
 [Timestamp('2014-07-01 00:00:00'), -0.0072877596184343085],
 [Timestamp('2014-07-02 00:00:00'), 0.010168645518006336],
 [Timestamp('2014-07-03 00:00:00'), -0.011539447143668391],
 [Timestamp('2014-07-04 00:00:00'), 0.025285678867997374],
 [Timestamp('2014-07-07 00:00:00'), -0.004602922207492033],
 [Timestamp('2014-07-08 00:00:00'), -0.031298707413768834],
 [Timestamp('2014-07-09 00:00:00'), 0.005929355847110296],
 [Timestamp('2014-07-10 00:00:00'), -0.0037464360290646592],
 [Timestamp('2014-07-11 00:00:00'), -0.030786217361942203],
 [Timestamp('2014-07-14 00:00:00'), -0.004914625647469917],
 [Timestamp('2014-07-15 00:00:00'), 0.010865602291856957],
 [Timestamp('2014-07-16 00:00:00'), 0.018000430446729165],
 [Timestamp('2014-07-17 00:00:00'), -0.007274924758687407],
 [Timestamp('2014-07-18 00:00:00'), -0.005852455583728933],
 [Timestamp('2014-07-21 00:00:00'), 0.021397540863909104],
 [Timestamp('2014-07-22 00:00:00'), 0.03337842963821558],
 [Timestamp('2014-07-23 00:00:00'), 0.0022309307682939483],
 [Timestamp('2014-07-24 00:00:00'), 0.007548983718178803],
 [Timestamp('2014-07-25 00:00:00'), -0.018442920569716525],
 [Timestamp('2014-07-28 00:00:00'), -0.015902529445214975]]

df = pd.DataFrame(a, columns=['dates', 'Values'])

I want to calculate the average of the column Values aggregating each 5 days. The expected outcome in dataframe should be something like

      Average value
0   avg of first 5days
1   avg of next 5days
2   avg of next 5days
3   avg of next 5days
4   avg of next 5days
5   avg of next 5days

If possible then please help me to get a dataframe something like the below,

    Group Days    Average value
0      0       avg of first 5days
1      1       avg of next 5days
2      2       avg of next 5days
3      3       avg of next 5days
4      4       avg of next 5days
5      5       avg of next 5days

Please help me with this.

CodePudding user response:

Use DataFrame.resample with aggregate mean by 5D for 5 days:

df = df.resample('5D', on='dates')['Values'].mean().reset_index()

print (df)
       dates    Values
0 2014-06-17 -0.006692
1 2014-06-22 -0.005764
2 2014-06-27 -0.001277
3 2014-07-02  0.007972
4 2014-07-07 -0.012901
5 2014-07-12  0.007984
6 2014-07-17  0.002757
7 2014-07-22  0.006179
8 2014-07-27 -0.015903

EDIT: If need omit Sundays, Saturdays and holidays use 5B for bussiness days:

df = df.resample('5B', on='dates')['Values'].mean().reset_index()
print (df)
       dates    Values
0 2014-06-17 -0.005074
1 2014-06-24 -0.004199
2 2014-07-01  0.002405
3 2014-07-08 -0.012963
4 2014-07-15  0.007427
5 2014-07-22  0.001763

If need Group column use arange:

df = df.resample('5D', on='dates')['Values'].mean().reset_index()
df['Group'] = np.arange(len(df))
print (df)
       dates    Values  Group
0 2014-06-17 -0.006692      0
1 2014-06-22 -0.005764      1
2 2014-06-27 -0.001277      2
3 2014-07-02  0.007972      3
4 2014-07-07 -0.012901      4
5 2014-07-12  0.007984      5
6 2014-07-17  0.002757      6
7 2014-07-22  0.006179      7
8 2014-07-27 -0.015903      8
  • Related