How do I aggregate data by hour?-CodePudding

So i'm currently working with a dataset so I can train a model to predict occupancy. I want to break down the dataset by hour, so aggregate the data every 60 minutes to get the mean value per hour or half an hour so its easier to read when plotting. How do I do that?

                  Date  Temperature  Humidity     Light       CO2  
1  2015-02-04 17:51:00     1.000000  0.470533  0.275490  0.190933   
2  2015-02-04 17:51:59     0.992823  0.470332  0.277754  0.186446   
3  2015-02-04 17:53:00     0.992823  0.469326  0.275490  0.186136   
4  2015-02-04 17:54:00     0.992823  0.467315  0.275490  0.182887   
5  2015-02-04 17:55:00     0.980861  0.467315  0.275490  0.180566   
6  2015-02-04 17:55:59     0.980861  0.467315  0.270964  0.178400   
7  2015-02-04 17:57:00     0.980861  0.467315  0.270964  0.178813   
8  2015-02-04 17:57:59     0.980861  0.467315  0.270964  0.177162   
9  2015-02-04 17:58:59     0.980861  0.467315  0.270964  0.171180   
10 2015-02-04 18:00:00     0.974880  0.466197  0.270964  0.170354   
11 2015-02-04 18:01:00     0.974880  0.465080  0.270964  0.171747   
12 2015-02-04 18:02:00     0.980861  0.462845  0.270964  0.172211   
13 2015-02-04 18:03:00     0.980861  0.465825  0.270964  0.167569   
14 2015-02-04 18:04:00     0.968900  0.465080  0.270964  0.170045   
15 2015-02-04 18:04:59     0.956938  0.463962  0.270964  0.169117   
16 2015-02-04 18:06:00     0.956938  0.463962  0.270640  0.165713   
17 2015-02-04 18:07:00     0.956938  0.467315  0.000000  0.166331   
18 2015-02-04 18:08:00     0.943780  0.471338  0.000000  0.168498   
19 2015-02-04 18:08:59     0.943780  0.475807  0.000000  0.168498   
20 2015-02-04 18:10:00     0.930622  0.475807  0.000000  0.170973

I'm sure this has probably been answered before but I couldn't find anything... i'm looking to plot this data.

CodePudding user response：

Essentially you'll convert your Date column to datetime-objects using pd.to_datetime and then simply df.groupby after setting the Date column as index using

df.groupby([df.index.hour]).mean()

Here's a full example:

import pandas as pd
from io import StringIO

data = """Date  Temperature  Humidity     Light       CO2  
2015-02-04 17:51:00     1.000000  0.470533  0.275490  0.190933   
2015-02-04 17:51:59     0.992823  0.470332  0.277754  0.186446   
2015-02-04 17:53:00     0.992823  0.469326  0.275490  0.186136   
2015-02-04 17:54:00     0.992823  0.467315  0.275490  0.182887   
2015-02-04 17:55:00     0.980861  0.467315  0.275490  0.180566   
2015-02-04 17:55:59     0.980861  0.467315  0.270964  0.178400   
2015-02-04 17:57:00     0.980861  0.467315  0.270964  0.178813   
2015-02-04 17:57:59     0.980861  0.467315  0.270964  0.177162   
2015-02-04 17:58:59     0.980861  0.467315  0.270964  0.171180   
2015-02-04 18:00:00     0.974880  0.466197  0.270964  0.170354   
2015-02-04 18:01:00     0.974880  0.465080  0.270964  0.171747   
2015-02-04 18:02:00     0.980861  0.462845  0.270964  0.172211   
2015-02-04 18:03:00     0.980861  0.465825  0.270964  0.167569   
2015-02-04 18:04:00     0.968900  0.465080  0.270964  0.170045   
2015-02-04 18:04:59     0.956938  0.463962  0.270964  0.169117   
2015-02-04 18:06:00     0.956938  0.463962  0.270640  0.165713   
2015-02-04 18:07:00     0.956938  0.467315  0.000000  0.166331   
2015-02-04 18:08:00     0.943780  0.471338  0.000000  0.168498   
2015-02-04 18:08:59     0.943780  0.475807  0.000000  0.168498   
2015-02-04 18:10:00     0.930622  0.475807  0.000000  0.170973 
"""

df = pd.read_csv(StringIO(data), sep="\s\s ")
df = df.set_index("Date")
df.index = pd.to_datetime(df.index)

print(df.groupby([df.index.hour]).mean())

      Temperature  Humidity     Light       CO2
Date                                           
17       0.986975  0.468231  0.273730  0.181391
18       0.960853  0.467565  0.172402  0.169187

CodePudding user response：

The tool you need is panasonic resample method.

To use it, Date column must be of datetime type, so if it is kept as a text, start from:

df.Date = pd.to_datetime(df.Date)

To compute the resampled DataFrame, run:

result = df.resample('30T', on="Date").mean()

In the above example 30T is the resample frequency - 30 mins. If you change your mind, set another value.

The result, for your source data sample, is:

                     Temperature  Humidity     Light       CO2
Date                                                          
2015-02-04 17:30:00     0.986975  0.468231  0.273730  0.181391
2015-02-04 18:00:00     0.960853  0.467565  0.172402  0.169187