How to calculate the mean value of a month but store it hourly in pandas?-CodePudding

I have weekly data for several years where I have the start date and end date in datetime format. I now want to make a new column for each year I have data where the mean value of each month is calculated and stored for each hour for the years. All years should have the same format, so ignoring the leap year. So to summarize I have the following data:

input_data:

datetime             | A | B | C | D | ... | Z |
---------------------|---|---|---|---| --- |---|
2015-01-01 00:00:00  |123| 23| 67|189| ... | 78|
...................  |...|...|...|...| ... |...|
2021-06-01 00:00:00  |345| 87|456| 89| ... | 23|

where I have 2015-01-01 00:00:00 as start date and 2021-06-01 08:00:00 as end date. I would like to get something like: output:

datetime        | 2015    |    2016 |     2017|      2018 | ... |     2021 |
----------------|---------|---------|---------|-----------|-----|----------|
01-01 00:00:00  |mean(A:Z)| mean(A:Z)| mean(A:Z)|mean(A:Z)| ... | mean(A:Z)|
................|.........|..........|..........|.........| ... |..........|
12-31 23:00:00  |mean(A:Z)| mean(A:Z)|mean(A:Z)| mean(A:Z)| ... | mean(A:Z)|

where mean(A:Z) is the mean value for each month of the columns A to Z. I would like to avoid to iterate over each hour for each year. How can best achieve this? Sorry if the question is to simple but I am currently stuck....

CodePudding user response：

IIUC, you can use:

# Update
out = (df.assign(datetime=df['datetime'].dt.strftime('%m-%d %H:%M:%S'),
                 year=df['datetime'].dt.year.values)
         .set_index(['datetime', 'year']).mean(axis=1)
         .unstack('year'))
print(out)

# Alternative
# out = (df.set_index('datetime').mean(axis=1).to_frame('mean')
#          .assign(datetime=df['datetime'].dt.strftime('%m-%d %H:%M:%S').values, 
#                  year=df['datetime'].dt.year.values)
#          .pivot('datetime', 'year', 'mean'))

# Output
year                  2015        2016        2017
datetime                                          
01-01 00:00:00  259.000000  420.000000  263.333333
01-01 01:00:00  263.000000  205.333333  169.000000
01-01 02:00:00  342.000000  268.000000  302.000000
01-01 03:00:00   63.000000  243.000000  220.000000
01-01 04:00:00  299.333333  282.666667  421.666667
...                    ...         ...         ...
12-31 19:00:00   82.666667  215.000000   84.333333
12-31 20:00:00  316.000000  367.000000  237.666667
12-31 21:00:00  319.666667  170.666667  275.666667
12-31 22:00:00  119.666667  263.666667  325.333333
12-31 23:00:00  252.666667  300.000000   94.666667

[8784 rows x 3 columns]

Setup:

import pandas
import numpy as np

np.random.seed(2022)
dti = pd.date_range('2015-01-01', '2017-12-31 23:00:00', freq='H', name='datetime')
df = pd.DataFrame(np.random.randint(1, 500, (len(dti), 3)),
                  index=dti, columns=list('ABC')).reset_index()

CodePudding user response：

I would start by creating a new column for the year in the original data frame

input_data['year'] = input_data['datetime'].dt.year

The I would use the groupby method wih a foreach loop to calculate the means as following

output = pd.DataFrame()
output['datetime'] = input_data['datetime']

for name, group in input_data.groupby(['year']):
    group.drop(['year', 'datetime'], axis = 1, inplace = True)
    output[name] = group.mean(axis = 1).reset_index(0,drop=True)

Output image That being said I am making an assumption here based on your question that the leap year is to be ignored and that all years has the same format and number of samples. If you have any further questions ot that the years don't have the same numbers of samples please tell me.