How to group datetime in date hour and take mean of the rows using Python-CodePudding

Is it possible to select all the rows within the same date and hour of a datetime, then take the mean of each of the columns of all these rows and hereby create a new row with the minutes and seconds rounded to 00:00 and where the columns contains the mean values?

Our dataframe looks like the following:

datetime               name    temperature    humidity    wind speed
2020-02-01 00:04:12    air     8.2            87          0.4
2020-02-01 00:22:54    air     7.8            84          0.3
2020-02-01 00:47:32    air     7.5            84          0.4
2020-02-01 01:09:35    air     7.2            86          0.1
2020-02-01 01:58:02    air     6.9            80          0.0
2020-02-01 02:14:51    air     6.8            81          0.2
...

What we want and therefore what the dataframe should be changed to:

datetime               name    temperature    humidity    wind speed
2020-02-01 00:00:00    air     7.83           85          0.36
2020-02-01 01:00:00    air     7.05           83          0.05
2020-02-01 02:00:00    air     6.8            81          0.2
...

We imagine that a proper solution would be to use the groupby method, where we can group by date hour, but we can't find a proper way of getting the output described above

df = pd.read_csv("file.csv")
df.groupby({date   hour}).mean()

CodePudding user response：

df.groupby([df.datetime.astype(str).str[:13]   ":00:00", df.name]).mean().reset_index()

CodePudding user response：

First you'd need to have your datetime field as a DatetimeIndex so it can by split to intervals automatically, since relying on string manipulations is error-prone in cases your timestamp field is malformed or the format changes.

You can do this by running:

df['datetime'] = pd.to_datetime(df['datetime'])

Then groupby and resample your dataframe like this (H will do by hour):

df.groupby('name').resample('H', on='datetime').mean()

example:

data = {
    'datetime':['2020-02-01 00:04:12', '2020-02-01 00:22:54', '2020-02-01 01:22:54'],
    'temp':[10,20,30], 
    'other_metric':[100,200,300],
    'name':['air','air','air']
}
df = pd.DataFrame(data)
df['datetime'] = pd.to_datetime(df['datetime'])
df.groupby('name').resample('H', on='datetime').mean()

will turn this:

             datetime  temp  other_metric name
2020-02-01 00:04:12    10           100  air
2020-02-01 00:22:54    20           200  air
2020-02-01 01:22:54    30           300  air

into that:

                          temp  other_metric
name datetime
air  2020-02-01 00:00:00  15.0         150.0
     2020-02-01 01:00:00  30.0         300.0