I have a dataframe that contains a time series with hourly data form 2015 to 2020. I want to create a new dataframe that has a column with the values of the time series for each year or for each month of each year to perform a separate analysis. As I have 1 leap year, I want them to share index but have a NaN value at that position (29 Feb) on the years that are not leap. I tried using merge creating two new columns called month and day_of_month but index gets crazy and ends up having millions of entries instead of the ~40.000 it should have, and in the end it ends up with a more than 20GB file on RAM and breaks:
years = pd.DataFrame(index=pd.date_range('2016-01-01', '2017-01-01', freq='1H'))
years['month'] = years.index.month
years['day_of_month'] = years.index.day
gp = data_md[['value', 'month', 'day_of_month']].groupby(pd.Grouper(freq='1Y'))
for name, group in gp:
years = years.merge(group, right_on=['month', 'day_of_month'], left_on=['month', 'day_of_month'])
RESULT:
month day_of_month value
0 1 1 0
1 1 1 6
2 1 1 2
3 1 1 0
4 1 1 1
... ... ... ...
210259 12 31 6
210260 12 31 2
210261 12 31 4
210262 12 31 5
210263 12 31 1
How can I get the frame constructed having one value column for each single year or month?
Here I leave the original frame from which I want to create the new one, only needed column by now is value
value month day_of_month week day_name year hour season dailyp day_of_week ... hourly_no_noise daily_trend daily_seasonal daily_residuals daily_no_noise daily_trend_h daily_seasonal_h daily_residuals_h daily_no_noise_h Total
date
2015-01-01 00:00:00 0 1 1 1 Thursday 2015 0 Invierno 165.0 3 ... NaN NaN -9.053524 NaN NaN NaN -3.456929 NaN NaN 6436996.0
2015-01-01 01:00:00 6 1 1 1 Thursday 2015 1 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -4.879983 NaN NaN NaN
2015-01-01 02:00:00 2 1 1 1 Thursday 2015 2 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -5.895367 NaN NaN NaN
2015-01-01 03:00:00 0 1 1 1 Thursday 2015 3 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -6.468616 NaN NaN NaN
2015-01-01 04:00:00 1 1 1 1 Thursday 2015 4 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -6.441830 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019-12-31 19:00:00 6 12 31 1 Tuesday 2019 19 Invierno NaN 1 ... 11.529465 230.571429 -4.997480 -11.299166 237.299166 9.613095 2.805720 1.176491 17.823509 NaN
2019-12-31 20:00:00 3 12 31 1 Tuesday 2019 20 Invierno NaN 1 ... 11.314857 230.571429 -4.997480 -11.299166 237.299166 9.613095 2.928751 1.176491 17.823509 NaN
2019-12-31 21:00:00 3 12 31 1 Tuesday 2019 21 Invierno NaN 1 ... 10.141139 230.571429 -4.997480 -11.299166 237.299166 9.613095 1.774848 1.176491 17.823509 NaN
2019-12-31 22:00:00 3 12 31 1 Tuesday 2019 22 Invierno NaN 1 ... 8.823152 230.571429 -4.997480 -11.299166 237.299166 9.613095 0.663344 1.176491 17.823509 NaN
2019-12-31 23:00:00 6 12 31 1 Tuesday 2019 23 Invierno NaN 1 ... 6.884636 230.571429 -4.997480 -11.299166 237.299166 9.613095 -1.624980 1.176491 17.823509 NaN
I would like to end up with a dataframe like this:
2015 2016 2017 2018 2019
2016-01-01 00:00:00 0.074053 0.218161 0.606810 0.687365 0.352672
2016-01-01 01:00:00 0.465167 0.210297 0.722825 0.683341 0.885175
2016-01-01 02:00:00 0.175964 0.610560 0.722479 0.016842 0.205916
2016-01-01 03:00:00 0.945955 0.807490 0.627525 0.187677 0.535116
2016-01-01 04:00:00 0.757608 0.797835 0.639215 0.455989 0.042285
... ... ... ... ... ...
2016-12-30 20:00:00 0.046138 0.139100 0.397547 0.738687 0.335306
2016-12-30 21:00:00 0.672800 0.802090 0.617625 0.787601 0.007535
2016-12-30 22:00:00 0.698141 0.776686 0.423712 0.667808 0.298338
2016-12-30 23:00:00 0.198089 0.642073 0.586527 0.106567 0.514569
2016-12-31 00:00:00 0.367572 0.390791 0.105193 0.592167 0.007365
where 29 Feb is NaN on non-leap years:
df['2016-02']
2015 2016 2017 2018 2019
2016-02-01 00:00:00 0.656703 0.348784 0.383639 0.208786 0.183642
2016-02-01 01:00:00 0.488729 0.909498 0.873642 0.122028 0.547563
2016-02-01 02:00:00 0.210427 0.912393 0.505873 0.085149 0.358841
2016-02-01 03:00:00 0.281107 0.534750 0.622473 0.643611 0.258437
2016-02-01 04:00:00 0.187434 0.327459 0.701008 0.887041 0.385816
... ... ... ... ... ...
2016-02-29 19:00:00 NaN 0.742402 NaN NaN NaN
2016-02-29 20:00:00 NaN 0.013419 NaN NaN NaN
2016-02-29 21:00:00 NaN 0.517194 NaN NaN NaN
2016-02-29 22:00:00 NaN 0.003136 NaN NaN NaN
2016-02-29 23:00:00 NaN 0.128406 NaN NaN NaN
CodePudding user response:
So let's assume we have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(pd.date_range('2015-01-01', '2020-01-01', freq='1H'), columns = ['Date and Time'])
df['str'] = df['Date and Time'].dt.strftime('%Y-%m-%d')
df[['Year', 'Month','Day']] = df['str'].apply(lambda x: pd.Series(str(x).split("-")))
df['Values'] = np.random.rand(len(df))
print(df)
Output:
Date and Time str Year Month Day Values
0 2015-01-01 00:00:00 2015-01-01 2015 01 01 0.806333
1 2015-01-01 01:00:00 2015-01-01 2015 01 01 0.696777
2 2015-01-01 02:00:00 2015-01-01 2015 01 01 0.044131
3 2015-01-01 03:00:00 2015-01-01 2015 01 01 0.027459
4 2015-01-01 04:00:00 2015-01-01 2015 01 01 0.868010
... ... ... ... ... .. ...
43820 2019-12-31 20:00:00 2019-12-31 2019 12 31 0.823029
43821 2019-12-31 21:00:00 2019-12-31 2019 12 31 0.693383
43822 2019-12-31 22:00:00 2019-12-31 2019 12 31 0.853509
43823 2019-12-31 23:00:00 2019-12-31 2019 12 31 0.727964
43824 2020-01-01 00:00:00 2020-01-01 2020 01 01 0.414747
[43825 rows x 6 columns]
Now we separate the dataframe by year and save it in a disk:
d = {}
for i in range(2015,2020):
d[i] = pd.DataFrame(df[df['Year'] == str(i)])
d[i].sort_values(by = 'Date and Time',inplace=True,ignore_index=True)
print(d[2015][d[2015]['Month'] == '02'])
print('-----------------------------------------------------------')
print(d[2016][d[2016]['Month'] == '02'])
print('-----------------------------------------------------------')
print(d[2017][d[2017]['Month'] == '02'])
print('-----------------------------------------------------------')
Output:
Date and Time str Year Month Day Values
744 2015-02-01 00:00:00 2015-02-01 2015 02 01 0.427832
745 2015-02-01 01:00:00 2015-02-01 2015 02 01 0.710451
... ... ... ... ... .. ...
1414 2015-02-28 22:00:00 2015-02-28 2015 02 28 0.016734
1415 2015-02-28 23:00:00 2015-02-28 2015 02 28 0.048966
[672 rows x 6 columns]
-----------------------------------------------------------
Date and Time str Year Month Day Values
744 2016-02-01 00:00:00 2016-02-01 2016 02 01 0.431218
745 2016-02-01 01:00:00 2016-02-01 2016 02 01 0.462041
... ... ... ... ... .. ...
1438 2016-02-29 22:00:00 2016-02-29 2016 02 29 0.493304
1439 2016-02-29 23:00:00 2016-02-29 2016 02 29 0.622349
[696 rows x 6 columns]
-----------------------------------------------------------
Date and Time str Year Month Day Values
744 2017-02-01 00:00:00 2017-02-01 2017 02 01 0.071900
745 2017-02-01 01:00:00 2017-02-01 2017 02 01 0.665318
... ... ... ... ... .. ...
1414 2017-02-28 22:00:00 2017-02-28 2017 02 28 0.463702
1415 2017-02-28 23:00:00 2017-02-28 2017 02 28 0.306710
[672 rows x 6 columns]
-----------------------------------------------------------
To fix the leap year problem:
There is definitely a better way, but the only thing I can think of is to create the value rows, add them, and then join the dataframes.
indexs = list(range(1416,1440))
lines = pd.DataFrame(np.nan ,columns = df.columns.values , index = indexs)
print(lines.head())
Output:
Date and Time str Year Month Day Values
1416 NaN NaN NaN NaN NaN NaN
1417 NaN NaN NaN NaN NaN NaN
1418 NaN NaN NaN NaN NaN NaN
1419 NaN NaN NaN NaN NaN NaN
1420 NaN NaN NaN NaN NaN NaN
Then I add the NaN rows to the data frame with the following code:
b = {}
for i in range(2015,2020):
if list(d[i][d[i]['Month'] == '02'].tail(1)['Day'])[0] == '28':
bi = pd.concat([d[i].iloc[0:1416], lines]).reset_index(drop=True)
b[i] = pd.concat([bi,d[i].iloc[1416:8783]]).reset_index(drop=True)
else:
b[i] = d[i].copy()
print(b[2017].iloc[1414:1441])
Output:
Date and Time str Year Month Day Values
1414 2017-02-28 22:00:00 2017-02-28 2017 02 28 0.463702
1415 2017-02-28 23:00:00 2017-02-28 2017 02 28 0.306710
1416 NaT NaN NaN NaN NaN NaN
1417 NaT NaN NaN NaN NaN NaN
1418 NaT NaN NaN NaN NaN NaN
1419 NaT NaN NaN NaN NaN NaN
1420 NaT NaN NaN NaN NaN NaN
1421 NaT NaN NaN NaN NaN NaN
1422 NaT NaN NaN NaN NaN NaN
1423 NaT NaN NaN NaN NaN NaN
1424 NaT NaN NaN NaN NaN NaN
1425 NaT NaN NaN NaN NaN NaN
1426 NaT NaN NaN NaN NaN NaN
1427 NaT NaN NaN NaN NaN NaN
1428 NaT NaN NaN NaN NaN NaN
1429 NaT NaN NaN NaN NaN NaN
1430 NaT NaN NaN NaN NaN NaN
1431 NaT NaN NaN NaN NaN NaN
1432 NaT NaN NaN NaN NaN NaN
1433 NaT NaN NaN NaN NaN NaN
1434 NaT NaN NaN NaN NaN NaN
1435 NaT NaN NaN NaN NaN NaN
1436 NaT NaN NaN NaN NaN NaN
1437 NaT NaN NaN NaN NaN NaN
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2017-03-01 00:00:00 2017-03-01 2017 03 01 0.474829
print(len(b[2015]))
print('----------------------')
print(len(b[2016]))
print('----------------------')
print(len(b[2017]))
print('----------------------')
print(len(b[2018]))
print('----------------------')
print(len(b[2019]))
print('----------------------')
Output:
8784
----------------------
8784
----------------------
8784
----------------------
8784
----------------------
8784
----------------------
And finally, if we want to create the dataframe you want:
final_df = pd.DataFrame(index = b[2016]['Date and Time'])
final_df[2015] = np.array(b[2015]['Values'])
final_df[2016] = np.array(b[2016]['Values'])
final_df[2017] = np.array(b[2017]['Values'])
final_df[2018] = np.array(b[2018]['Values'])
final_df[2019] = np.array(b[2019]['Values'])
print(final_df)
Output:
2015 2016 2017 2018 2019
Date and Time
2016-01-01 00:00:00 0.681452 0.893695 0.107281 0.055448 0.524573
2016-01-01 01:00:00 0.860485 0.036340 0.036067 0.540155 0.367901
2016-01-01 02:00:00 0.159925 0.774041 0.272395 0.570306 0.443723
2016-01-01 03:00:00 0.149609 0.271355 0.044498 0.760091 0.082329
2016-01-01 04:00:00 0.836742 0.344297 0.202729 0.319588 0.979587
... ... ... ... ... ...
2016-12-31 19:00:00 0.602483 0.514097 0.539197 0.213343 0.915307
2016-12-31 20:00:00 0.411157 0.760006 0.622840 0.067084 0.945960
2016-12-31 21:00:00 0.212591 0.324712 0.097114 0.333916 0.267258
2016-12-31 22:00:00 0.845888 0.937634 0.295480 0.564552 0.429678
2016-12-31 23:00:00 0.502465 0.582802 0.293368 0.219390 0.125926
[8784 rows x 5 columns]
2015 2016 2017 2018 2019
Date and Time
2016-02-28 22:00:00 0.005703 0.758915 0.920366 0.894347 0.916487
2016-02-28 23:00:00 0.450760 0.481793 0.804420 0.555427 0.237902
2016-02-29 00:00:00 NaN 0.966666 NaN NaN NaN
2016-02-29 01:00:00 NaN 0.579951 NaN NaN NaN
CodePudding user response:
IIUC, you just need the original DataFrame:
origin = 2016 # or whatever year of your chosing
newidx = df.index.strftime(f'{origin}-%m-%d %H:%M:%S')
newdf = (
df[['value']]
.assign(year=df.index.year)
.set_axis(newidx, axis=0)
.pivot(columns='year', values='value')
)
Using the small sample data you provided for that "original frame" df
, we get:
>>> newdf
year 2015 2019
date
2016-01-01 00:00:00 0.0 NaN
2016-01-01 01:00:00 6.0 NaN
2016-01-01 02:00:00 2.0 NaN
... ... ...
2016-12-31 21:00:00 NaN 3.0
2016-12-31 22:00:00 NaN 3.0
2016-12-31 23:00:00 NaN 6.0
On a larger (made-up) DataFrame:
np.random.seed(0)
ix = pd.date_range('2015', '2020', freq='H', inclusive='left')
df = pd.DataFrame({'value': np.random.randint(0, 100, len(ix))}, index=ix)
# (code above)
>>> newdf
year 2015 2016 2017 2018 2019
2016-01-01 00:00:00 44.0 82.0 96.0 68.0 71.0
2016-01-01 01:00:00 47.0 99.0 54.0 44.0 71.0
2016-01-01 02:00:00 64.0 28.0 11.0 10.0 55.0
... ... ... ... ... ...
2016-12-31 21:00:00 0.0 30.0 28.0 53.0 14.0
2016-12-31 22:00:00 47.0 82.0 19.0 6.0 64.0
2016-12-31 23:00:00 22.0 75.0 13.0 37.0 35.0
and, as expected, only 2016 has values for 02/29:
>>> newdf[:'2016-02-29 02:00:00'].tail()
year 2015 2016 2017 2018 2019
2016-02-28 22:00:00 74.0 54.0 22.0 17.0 39.0
2016-02-28 23:00:00 37.0 61.0 31.0 8.0 62.0
2016-02-29 00:00:00 NaN 34.0 NaN NaN NaN
2016-02-29 01:00:00 NaN 82.0 NaN NaN NaN
2016-02-29 02:00:00 NaN 67.0 NaN NaN NaN