I have weekly data for several years where I have the start date and end date in datetime format. I now want to make a new column for each year I have data where the mean value of each month is calculated and stored for each hour for the years. All years should have the same format, so ignoring the leap year. So to summarize I have the following data:
input_data:
datetime | A | B | C | D | ... | Z |
---------------------|---|---|---|---| --- |---|
2015-01-01 00:00:00 |123| 23| 67|189| ... | 78|
................... |...|...|...|...| ... |...|
2021-06-01 00:00:00 |345| 87|456| 89| ... | 23|
where I have 2015-01-01 00:00:00 as start date and 2021-06-01 08:00:00 as end date. I would like to get something like: output:
datetime | 2015 | 2016 | 2017| 2018 | ... | 2021 |
----------------|---------|---------|---------|-----------|-----|----------|
01-01 00:00:00 |mean(A:Z)| mean(A:Z)| mean(A:Z)|mean(A:Z)| ... | mean(A:Z)|
................|.........|..........|..........|.........| ... |..........|
12-31 23:00:00 |mean(A:Z)| mean(A:Z)|mean(A:Z)| mean(A:Z)| ... | mean(A:Z)|
where mean(A:Z) is the mean value for each month of the columns A to Z. I would like to avoid to iterate over each hour for each year. How can best achieve this? Sorry if the question is to simple but I am currently stuck....
CodePudding user response:
IIUC, you can use:
# Update
out = (df.assign(datetime=df['datetime'].dt.strftime('%m-%d %H:%M:%S'),
year=df['datetime'].dt.year.values)
.set_index(['datetime', 'year']).mean(axis=1)
.unstack('year'))
print(out)
# Alternative
# out = (df.set_index('datetime').mean(axis=1).to_frame('mean')
# .assign(datetime=df['datetime'].dt.strftime('%m-%d %H:%M:%S').values,
# year=df['datetime'].dt.year.values)
# .pivot('datetime', 'year', 'mean'))
# Output
year 2015 2016 2017
datetime
01-01 00:00:00 259.000000 420.000000 263.333333
01-01 01:00:00 263.000000 205.333333 169.000000
01-01 02:00:00 342.000000 268.000000 302.000000
01-01 03:00:00 63.000000 243.000000 220.000000
01-01 04:00:00 299.333333 282.666667 421.666667
... ... ... ...
12-31 19:00:00 82.666667 215.000000 84.333333
12-31 20:00:00 316.000000 367.000000 237.666667
12-31 21:00:00 319.666667 170.666667 275.666667
12-31 22:00:00 119.666667 263.666667 325.333333
12-31 23:00:00 252.666667 300.000000 94.666667
[8784 rows x 3 columns]
Setup:
import pandas
import numpy as np
np.random.seed(2022)
dti = pd.date_range('2015-01-01', '2017-12-31 23:00:00', freq='H', name='datetime')
df = pd.DataFrame(np.random.randint(1, 500, (len(dti), 3)),
index=dti, columns=list('ABC')).reset_index()
CodePudding user response:
I would start by creating a new column for the year in the original data frame
input_data['year'] = input_data['datetime'].dt.year
The I would use the groupby method wih a foreach loop to calculate the means as following
output = pd.DataFrame()
output['datetime'] = input_data['datetime']
for name, group in input_data.groupby(['year']):
group.drop(['year', 'datetime'], axis = 1, inplace = True)
output[name] = group.mean(axis = 1).reset_index(0,drop=True)
Output image That being said I am making an assumption here based on your question that the leap year is to be ignored and that all years has the same format and number of samples. If you have any further questions ot that the years don't have the same numbers of samples please tell me.