I am attempting to use Pandas to calculate hourly averages from one minute data for a fairly large dataset. My dataset is stored as a .csv, and looks like:
,timestamp_local,pm25
0,5/30/2022 16:59,2.2802
1,5/30/2022 16:58,2.7055
2,5/30/2022 16:57,1.6602
3,5/30/2022 16:56,1.5924
4,5/30/2022 16:55,2.2792
5,5/30/2022 16:54,2.2219
I attempted to read the .csv file as a Pandas DataFrame and then store the hourly means in a new DataFrame that I will export:
import pandas as pd
df = pd.read_csv('extracted_data.csv')
df['timestamp_local'] = pd.to_datetime(df['timestamp_local'])
df1 = df.groupby([pd.Grouper(freq='H', key='timestamp_local'), 'pm25']).mean()
print(df1)
After executing this script, I am left with the minute data grouped by it's respective hour:
timestamp_local pm25 unnamed: 0
2022-04-30 17:00:00 0.8766 42917.0
0.9124 42913.0
0.9647 42893.0
0.9842 42883.0
1.0974 42894.0
... ...
2022-05-30 16:00:00 3.0069 43.0
3.1727 40.0
3.2482 30.0
3.3628 42.0
4.5797 13.0
My goal is to eliminate the minute data altogether so I would just have the timestamp and the calculated hourly average, but I am unsure how to go about this. Any advice would be greatly appreciated.
CodePudding user response:
have you tried this? With a pandas pipeline you should be able to stack grouping and resampling methods.
df1 = df.groupby('pm25']).resample('H').mean()
CodePudding user response:
I was able to figure this out:
import pandas as pd
df = pd.read_csv('extracted_data.csv')
df.index = pd.to_datetime(df['timestamp_local'])
df1 = df.groupby([df.index.day, df.index.hour]).mean()
df1.to_excel('output.xlsx')
This gave me the data grouped by day & hour with the calculated hourly mean of the 'pm25' parameter:
timestamp_local timestamp_local pm25
1 0 4.917292
1 2.976207
2 15.244288
3 5.802312
4 6.185230
... ... ...
30 19 2.388808
20 1.300867
21 2.053445
22 7.007275
23 10.163283