So i'm currently working with a dataset so I can train a model to predict occupancy. I want to break down the dataset by hour, so aggregate the data every 60 minutes to get the mean value per hour or half an hour so its easier to read when plotting. How do I do that?
Date Temperature Humidity Light CO2
1 2015-02-04 17:51:00 1.000000 0.470533 0.275490 0.190933
2 2015-02-04 17:51:59 0.992823 0.470332 0.277754 0.186446
3 2015-02-04 17:53:00 0.992823 0.469326 0.275490 0.186136
4 2015-02-04 17:54:00 0.992823 0.467315 0.275490 0.182887
5 2015-02-04 17:55:00 0.980861 0.467315 0.275490 0.180566
6 2015-02-04 17:55:59 0.980861 0.467315 0.270964 0.178400
7 2015-02-04 17:57:00 0.980861 0.467315 0.270964 0.178813
8 2015-02-04 17:57:59 0.980861 0.467315 0.270964 0.177162
9 2015-02-04 17:58:59 0.980861 0.467315 0.270964 0.171180
10 2015-02-04 18:00:00 0.974880 0.466197 0.270964 0.170354
11 2015-02-04 18:01:00 0.974880 0.465080 0.270964 0.171747
12 2015-02-04 18:02:00 0.980861 0.462845 0.270964 0.172211
13 2015-02-04 18:03:00 0.980861 0.465825 0.270964 0.167569
14 2015-02-04 18:04:00 0.968900 0.465080 0.270964 0.170045
15 2015-02-04 18:04:59 0.956938 0.463962 0.270964 0.169117
16 2015-02-04 18:06:00 0.956938 0.463962 0.270640 0.165713
17 2015-02-04 18:07:00 0.956938 0.467315 0.000000 0.166331
18 2015-02-04 18:08:00 0.943780 0.471338 0.000000 0.168498
19 2015-02-04 18:08:59 0.943780 0.475807 0.000000 0.168498
20 2015-02-04 18:10:00 0.930622 0.475807 0.000000 0.170973
I'm sure this has probably been answered before but I couldn't find anything... i'm looking to plot this data.
CodePudding user response:
Essentially you'll convert your Date
column to datetime-objects using pd.to_datetime
and then simply df.groupby
after setting the Date
column as index using
df.groupby([df.index.hour]).mean()
Here's a full example:
import pandas as pd
from io import StringIO
data = """Date Temperature Humidity Light CO2
2015-02-04 17:51:00 1.000000 0.470533 0.275490 0.190933
2015-02-04 17:51:59 0.992823 0.470332 0.277754 0.186446
2015-02-04 17:53:00 0.992823 0.469326 0.275490 0.186136
2015-02-04 17:54:00 0.992823 0.467315 0.275490 0.182887
2015-02-04 17:55:00 0.980861 0.467315 0.275490 0.180566
2015-02-04 17:55:59 0.980861 0.467315 0.270964 0.178400
2015-02-04 17:57:00 0.980861 0.467315 0.270964 0.178813
2015-02-04 17:57:59 0.980861 0.467315 0.270964 0.177162
2015-02-04 17:58:59 0.980861 0.467315 0.270964 0.171180
2015-02-04 18:00:00 0.974880 0.466197 0.270964 0.170354
2015-02-04 18:01:00 0.974880 0.465080 0.270964 0.171747
2015-02-04 18:02:00 0.980861 0.462845 0.270964 0.172211
2015-02-04 18:03:00 0.980861 0.465825 0.270964 0.167569
2015-02-04 18:04:00 0.968900 0.465080 0.270964 0.170045
2015-02-04 18:04:59 0.956938 0.463962 0.270964 0.169117
2015-02-04 18:06:00 0.956938 0.463962 0.270640 0.165713
2015-02-04 18:07:00 0.956938 0.467315 0.000000 0.166331
2015-02-04 18:08:00 0.943780 0.471338 0.000000 0.168498
2015-02-04 18:08:59 0.943780 0.475807 0.000000 0.168498
2015-02-04 18:10:00 0.930622 0.475807 0.000000 0.170973
"""
df = pd.read_csv(StringIO(data), sep="\s\s ")
df = df.set_index("Date")
df.index = pd.to_datetime(df.index)
print(df.groupby([df.index.hour]).mean())
Temperature Humidity Light CO2
Date
17 0.986975 0.468231 0.273730 0.181391
18 0.960853 0.467565 0.172402 0.169187
CodePudding user response:
The tool you need is panasonic resample method.
To use it, Date column must be of datetime type, so if it is kept as a text, start from:
df.Date = pd.to_datetime(df.Date)
To compute the resampled DataFrame, run:
result = df.resample('30T', on="Date").mean()
In the above example 30T is the resample frequency - 30 mins. If you change your mind, set another value.
The result, for your source data sample, is:
Temperature Humidity Light CO2
Date
2015-02-04 17:30:00 0.986975 0.468231 0.273730 0.181391
2015-02-04 18:00:00 0.960853 0.467565 0.172402 0.169187