Home > Net >  Pandas dataframe with hourly data: Calculating sums for specific times
Pandas dataframe with hourly data: Calculating sums for specific times

Time:11-10

There is a dataframe with hourly data, e.g.:

      DATE      TIME     Amount 
2022-11-07  21:00:00         10      
2022-11-07  22:00:00         11      
2022-11-08  07:00:00         10       
2022-11-08  08:00:00         13       
2022-11-08  09:00:00         12     
2022-11-08  10:00:00         11       
2022-11-08  11:00:00         13      
2022-11-08  12:00:00         12        
2022-11-08  13:00:00         10        
2022-11-08  14:00:00         9       
... 

I would like to add a new column sum_morning where I calculate the sum of "Amount" for the morning hours only (07:00 - 12:00):

      DATE      TIME     Amount      sum_morning
2022-11-07  21:00:00         10      NaN
2022-11-07  22:00:00         11      NaN
2022-11-08  07:00:00         10      NaN
2022-11-08  08:00:00         13      NaN
2022-11-08  09:00:00         12      NaN
2022-11-08  10:00:00         11      NaN
2022-11-08  11:00:00         13      NaN
2022-11-08  12:00:00         12      71  
2022-11-08  13:00:00         10      NaN  
2022-11-08  14:00:00         9       NaN
... 

There can be gaps in the dataframe (e.g. from 22:00 - 07:00), so shift is probably not working here.

I thought about

  • creating a new dataframe where I filter all time slices from 07:00 - 12:00 for all dates
  • do a group by and calculate the sum for each day
  • and then merge this back to the original df.

But maybe there is a more effective solution?

I really enjoy working with Python / pandas, but hourly data still makes my head spin.

CodePudding user response:

First set a DatetimeIndex in order to use DataFrame.between_time, then groupby DATE and aggregate by sum. Finally, get the last value of datetimes per day, in order to match the index of the original DataFrame:

df.index = pd.to_datetime(df['DATE']   ' '   df['TIME'])

s = (df.between_time('7:00','12:00')
       .reset_index()
       .groupby('DATE')
       .agg({'Amount':'sum', 'index':'last'})
       .set_index('index')['Amount'])

df['sum_morning'] = s
print (df)
                           DATE      TIME  Amount  sum_morning
2022-11-07 21:00:00  2022-11-07  21:00:00      10          NaN
2022-11-07 22:00:00  2022-11-07  22:00:00      11          NaN
2022-11-08 07:00:00  2022-11-08  07:00:00      10          NaN
2022-11-08 08:00:00  2022-11-08  08:00:00      13          NaN
2022-11-08 09:00:00  2022-11-08  09:00:00      12          NaN
2022-11-08 10:00:00  2022-11-08  10:00:00      11          NaN
2022-11-08 11:00:00  2022-11-08  11:00:00      13          NaN
2022-11-08 12:00:00  2022-11-08  12:00:00      12         71.0
2022-11-08 13:00:00  2022-11-08  13:00:00      10          NaN
2022-11-08 14:00:00  2022-11-08  14:00:00       9          NaN

Lastly, if you need to remove DatetimeIndex you can use:

df = df.reset_index(drop=True)

CodePudding user response:

You can use:

# get values between 7 and 12h
m = pd.to_timedelta(df['TIME']).between('7h', '12h')

# find last True per day
idx = m&m.groupby(df['DATE']).shift(-1).ne(True)

# assign the sum of the 7-12h values on the last True per day
df.loc[idx, 'sum_morning'] = df['Amount'].where(m).groupby(df['DATE']).transform('sum')

Output:

         DATE      TIME  Amount  sum_morning
0  2022-11-07  21:00:00      10          NaN
1  2022-11-07  22:00:00      11          NaN
2  2022-11-08  07:00:00      10          NaN
3  2022-11-08  08:00:00      13          NaN
4  2022-11-08  09:00:00      12          NaN
5  2022-11-08  10:00:00      11          NaN
6  2022-11-08  11:00:00      13          NaN
7  2022-11-08  12:00:00      12         71.0
8  2022-11-08  13:00:00      10          NaN
9  2022-11-08  14:00:00       9          NaN
  • Related