How can I split my dataframe by year or month-CodePudding

I have a dataframe that contains a time series with hourly data form 2015 to 2020. I want to create a new dataframe that has a column with the values of the time series for each year or for each month of each year to perform a separate analysis. As I have 1 leap year, I want them to share index but have a NaN value at that position (29 Feb) on the years that are not leap. I tried using merge creating two new columns called month and day_of_month but index gets crazy and ends up having millions of entries instead of the ~40.000 it should have, and in the end it ends up with a more than 20GB file on RAM and breaks:

years = pd.DataFrame(index=pd.date_range('2016-01-01', '2017-01-01', freq='1H'))
years['month'] = years.index.month
years['day_of_month'] = years.index.day
gp = data_md[['value', 'month', 'day_of_month']].groupby(pd.Grouper(freq='1Y'))
for name, group in gp:
    years = years.merge(group, right_on=['month', 'day_of_month'], left_on=['month', 'day_of_month'])


RESULT:

    month   day_of_month    value
0   1   1   0
1   1   1   6
2   1   1   2
3   1   1   0
4   1   1   1
...     ...     ...     ...
210259  12  31  6
210260  12  31  2
210261  12  31  4
210262  12  31  5
210263  12  31  1

How can I get the frame constructed having one value column for each single year or month?

Here I leave the original frame from which I want to create the new one, only needed column by now is value

    value    month  day_of_month    week    day_name    year    hour    season  dailyp  day_of_week     ...     hourly_no_noise     daily_trend     daily_seasonal  daily_residuals     daily_no_noise  daily_trend_h   daily_seasonal_h    daily_residuals_h   daily_no_noise_h    Total
date                                                                                    
2015-01-01 00:00:00     0   1   1   1   Thursday    2015    0   Invierno    165.0   3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -3.456929   NaN     NaN     6436996.0
2015-01-01 01:00:00     6   1   1   1   Thursday    2015    1   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -4.879983   NaN     NaN     NaN
2015-01-01 02:00:00     2   1   1   1   Thursday    2015    2   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -5.895367   NaN     NaN     NaN
2015-01-01 03:00:00     0   1   1   1   Thursday    2015    3   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -6.468616   NaN     NaN     NaN
2015-01-01 04:00:00     1   1   1   1   Thursday    2015    4   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -6.441830   NaN     NaN     NaN
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
2019-12-31 19:00:00     6   12  31  1   Tuesday     2019    19  Invierno    NaN     1   ...     11.529465   230.571429  -4.997480   -11.299166  237.299166  9.613095    2.805720    1.176491    17.823509   NaN
2019-12-31 20:00:00     3   12  31  1   Tuesday     2019    20  Invierno    NaN     1   ...     11.314857   230.571429  -4.997480   -11.299166  237.299166  9.613095    2.928751    1.176491    17.823509   NaN
2019-12-31 21:00:00     3   12  31  1   Tuesday     2019    21  Invierno    NaN     1   ...     10.141139   230.571429  -4.997480   -11.299166  237.299166  9.613095    1.774848    1.176491    17.823509   NaN
2019-12-31 22:00:00     3   12  31  1   Tuesday     2019    22  Invierno    NaN     1   ...     8.823152    230.571429  -4.997480   -11.299166  237.299166  9.613095    0.663344    1.176491    17.823509   NaN
2019-12-31 23:00:00     6   12  31  1   Tuesday     2019    23  Invierno    NaN     1   ...     6.884636    230.571429  -4.997480   -11.299166  237.299166  9.613095    -1.624980   1.176491    17.823509   NaN

I would like to end up with a dataframe like this:

                         2015      2016      2017      2018      2019
2016-01-01 00:00:00  0.074053  0.218161  0.606810  0.687365  0.352672
2016-01-01 01:00:00  0.465167  0.210297  0.722825  0.683341  0.885175
2016-01-01 02:00:00  0.175964  0.610560  0.722479  0.016842  0.205916
2016-01-01 03:00:00  0.945955  0.807490  0.627525  0.187677  0.535116
2016-01-01 04:00:00  0.757608  0.797835  0.639215  0.455989  0.042285
...                       ...       ...       ...       ...       ...
2016-12-30 20:00:00  0.046138  0.139100  0.397547  0.738687  0.335306
2016-12-30 21:00:00  0.672800  0.802090  0.617625  0.787601  0.007535
2016-12-30 22:00:00  0.698141  0.776686  0.423712  0.667808  0.298338
2016-12-30 23:00:00  0.198089  0.642073  0.586527  0.106567  0.514569
2016-12-31 00:00:00  0.367572  0.390791  0.105193  0.592167  0.007365

where 29 Feb is NaN on non-leap years:

df['2016-02']
                         2015      2016      2017      2018      2019
2016-02-01 00:00:00  0.656703  0.348784  0.383639  0.208786  0.183642
2016-02-01 01:00:00  0.488729  0.909498  0.873642  0.122028  0.547563
2016-02-01 02:00:00  0.210427  0.912393  0.505873  0.085149  0.358841
2016-02-01 03:00:00  0.281107  0.534750  0.622473  0.643611  0.258437
2016-02-01 04:00:00  0.187434  0.327459  0.701008  0.887041  0.385816
...                       ...       ...       ...       ...       ...
2016-02-29 19:00:00       NaN  0.742402       NaN       NaN       NaN
2016-02-29 20:00:00       NaN  0.013419       NaN       NaN       NaN
2016-02-29 21:00:00       NaN  0.517194       NaN       NaN       NaN
2016-02-29 22:00:00       NaN  0.003136       NaN       NaN       NaN
2016-02-29 23:00:00       NaN  0.128406       NaN       NaN       NaN

CodePudding user response：

So let's assume we have the following dataframe:

import pandas as pd
import numpy as np
df = pd.DataFrame(pd.date_range('2015-01-01', '2020-01-01', freq='1H'), columns = ['Date and Time'])
df['str'] = df['Date and Time'].dt.strftime('%Y-%m-%d')
df[['Year', 'Month','Day']] = df['str'].apply(lambda x: pd.Series(str(x).split("-")))
df['Values'] = np.random.rand(len(df))
print(df)

Output:

            Date and Time         str  Year Month Day    Values
0     2015-01-01 00:00:00  2015-01-01  2015    01  01  0.806333
1     2015-01-01 01:00:00  2015-01-01  2015    01  01  0.696777
2     2015-01-01 02:00:00  2015-01-01  2015    01  01  0.044131
3     2015-01-01 03:00:00  2015-01-01  2015    01  01  0.027459
4     2015-01-01 04:00:00  2015-01-01  2015    01  01  0.868010
...                   ...         ...   ...   ...  ..       ...
43820 2019-12-31 20:00:00  2019-12-31  2019    12  31  0.823029
43821 2019-12-31 21:00:00  2019-12-31  2019    12  31  0.693383
43822 2019-12-31 22:00:00  2019-12-31  2019    12  31  0.853509
43823 2019-12-31 23:00:00  2019-12-31  2019    12  31  0.727964
43824 2020-01-01 00:00:00  2020-01-01  2020    01  01  0.414747

[43825 rows x 6 columns]

Now we separate the dataframe by year and save it in a disk:

d = {}
for i in range(2015,2020):
  d[i] = pd.DataFrame(df[df['Year'] == str(i)])
  d[i].sort_values(by = 'Date and Time',inplace=True,ignore_index=True)


print(d[2015][d[2015]['Month'] == '02'])
print('-----------------------------------------------------------')
print(d[2016][d[2016]['Month'] == '02'])
print('-----------------------------------------------------------')
print(d[2017][d[2017]['Month'] == '02'])
print('-----------------------------------------------------------')

Output:

           Date and Time         str  Year Month Day    Values
744  2015-02-01 00:00:00  2015-02-01  2015    02  01  0.427832
745  2015-02-01 01:00:00  2015-02-01  2015    02  01  0.710451
...                  ...         ...   ...   ...  ..       ...
1414 2015-02-28 22:00:00  2015-02-28  2015    02  28  0.016734
1415 2015-02-28 23:00:00  2015-02-28  2015    02  28  0.048966

[672 rows x 6 columns]
-----------------------------------------------------------
           Date and Time         str  Year Month Day    Values
744  2016-02-01 00:00:00  2016-02-01  2016    02  01  0.431218
745  2016-02-01 01:00:00  2016-02-01  2016    02  01  0.462041
...                  ...         ...   ...   ...  ..       ...
1438 2016-02-29 22:00:00  2016-02-29  2016    02  29  0.493304
1439 2016-02-29 23:00:00  2016-02-29  2016    02  29  0.622349

[696 rows x 6 columns]
-----------------------------------------------------------
           Date and Time         str  Year Month Day    Values
744  2017-02-01 00:00:00  2017-02-01  2017    02  01  0.071900
745  2017-02-01 01:00:00  2017-02-01  2017    02  01  0.665318
...                  ...         ...   ...   ...  ..       ...
1414 2017-02-28 22:00:00  2017-02-28  2017    02  28  0.463702
1415 2017-02-28 23:00:00  2017-02-28  2017    02  28  0.306710

[672 rows x 6 columns]
-----------------------------------------------------------

To fix the leap year problem:

There is definitely a better way, but the only thing I can think of is to create the value rows, add them, and then join the dataframes.

indexs = list(range(1416,1440))
lines = pd.DataFrame(np.nan ,columns =  df.columns.values , index = indexs)
print(lines.head())

Output:

      Date and Time  str  Year  Month  Day  Values
1416            NaN  NaN   NaN    NaN  NaN     NaN
1417            NaN  NaN   NaN    NaN  NaN     NaN
1418            NaN  NaN   NaN    NaN  NaN     NaN
1419            NaN  NaN   NaN    NaN  NaN     NaN
1420            NaN  NaN   NaN    NaN  NaN     NaN

Then I add the NaN rows to the data frame with the following code:

b = {}
for i in range(2015,2020):
  if list(d[i][d[i]['Month'] == '02'].tail(1)['Day'])[0] == '28':
    bi = pd.concat([d[i].iloc[0:1416], lines]).reset_index(drop=True)
    b[i] = pd.concat([bi,d[i].iloc[1416:8783]]).reset_index(drop=True)
  else:
    b[i] = d[i].copy()

print(b[2017].iloc[1414:1441])

Output:

           Date and Time         str  Year Month  Day    Values
1414 2017-02-28 22:00:00  2017-02-28  2017    02   28  0.463702
1415 2017-02-28 23:00:00  2017-02-28  2017    02   28  0.306710
1416                 NaT         NaN   NaN   NaN  NaN       NaN
1417                 NaT         NaN   NaN   NaN  NaN       NaN
1418                 NaT         NaN   NaN   NaN  NaN       NaN
1419                 NaT         NaN   NaN   NaN  NaN       NaN
1420                 NaT         NaN   NaN   NaN  NaN       NaN
1421                 NaT         NaN   NaN   NaN  NaN       NaN
1422                 NaT         NaN   NaN   NaN  NaN       NaN
1423                 NaT         NaN   NaN   NaN  NaN       NaN
1424                 NaT         NaN   NaN   NaN  NaN       NaN
1425                 NaT         NaN   NaN   NaN  NaN       NaN
1426                 NaT         NaN   NaN   NaN  NaN       NaN
1427                 NaT         NaN   NaN   NaN  NaN       NaN
1428                 NaT         NaN   NaN   NaN  NaN       NaN
1429                 NaT         NaN   NaN   NaN  NaN       NaN
1430                 NaT         NaN   NaN   NaN  NaN       NaN
1431                 NaT         NaN   NaN   NaN  NaN       NaN
1432                 NaT         NaN   NaN   NaN  NaN       NaN
1433                 NaT         NaN   NaN   NaN  NaN       NaN
1434                 NaT         NaN   NaN   NaN  NaN       NaN
1435                 NaT         NaN   NaN   NaN  NaN       NaN
1436                 NaT         NaN   NaN   NaN  NaN       NaN
1437                 NaT         NaN   NaN   NaN  NaN       NaN
1438                 NaT         NaN   NaN   NaN  NaN       NaN
1439                 NaT         NaN   NaN   NaN  NaN       NaN
1440 2017-03-01 00:00:00  2017-03-01  2017    03   01  0.474829

print(len(b[2015]))
print('----------------------')
print(len(b[2016]))
print('----------------------')
print(len(b[2017]))
print('----------------------')
print(len(b[2018]))
print('----------------------')
print(len(b[2019]))
print('----------------------')

Output:

8784
----------------------
8784
----------------------
8784
----------------------
8784
----------------------
8784
----------------------

And finally, if we want to create the dataframe you want:

final_df = pd.DataFrame(index = b[2016]['Date and Time'])
final_df[2015] = np.array(b[2015]['Values'])
final_df[2016] = np.array(b[2016]['Values'])
final_df[2017] = np.array(b[2017]['Values'])
final_df[2018] = np.array(b[2018]['Values'])
final_df[2019] = np.array(b[2019]['Values'])
print(final_df)

Output:

                         2015      2016      2017      2018      2019
Date and Time                                                        
2016-01-01 00:00:00  0.681452  0.893695  0.107281  0.055448  0.524573
2016-01-01 01:00:00  0.860485  0.036340  0.036067  0.540155  0.367901
2016-01-01 02:00:00  0.159925  0.774041  0.272395  0.570306  0.443723
2016-01-01 03:00:00  0.149609  0.271355  0.044498  0.760091  0.082329
2016-01-01 04:00:00  0.836742  0.344297  0.202729  0.319588  0.979587
...                       ...       ...       ...       ...       ...
2016-12-31 19:00:00  0.602483  0.514097  0.539197  0.213343  0.915307
2016-12-31 20:00:00  0.411157  0.760006  0.622840  0.067084  0.945960
2016-12-31 21:00:00  0.212591  0.324712  0.097114  0.333916  0.267258
2016-12-31 22:00:00  0.845888  0.937634  0.295480  0.564552  0.429678
2016-12-31 23:00:00  0.502465  0.582802  0.293368  0.219390  0.125926

[8784 rows x 5 columns]

                         2015      2016      2017      2018      2019
Date and Time                                                        
2016-02-28 22:00:00  0.005703  0.758915  0.920366  0.894347  0.916487
2016-02-28 23:00:00  0.450760  0.481793  0.804420  0.555427  0.237902
2016-02-29 00:00:00       NaN  0.966666       NaN       NaN       NaN
2016-02-29 01:00:00       NaN  0.579951       NaN       NaN       NaN

CodePudding user response：

IIUC, you just need the original DataFrame:

origin = 2016  # or whatever year of your chosing
newidx = df.index.strftime(f'{origin}-%m-%d %H:%M:%S')
newdf = (
    df[['value']]
    .assign(year=df.index.year)
    .set_axis(newidx, axis=0)
    .pivot(columns='year', values='value')
)

Using the small sample data you provided for that "original frame" df, we get:

>>> newdf
year                 2015  2019
date                           
2016-01-01 00:00:00   0.0   NaN
2016-01-01 01:00:00   6.0   NaN
2016-01-01 02:00:00   2.0   NaN
...                   ...   ...
2016-12-31 21:00:00   NaN   3.0
2016-12-31 22:00:00   NaN   3.0
2016-12-31 23:00:00   NaN   6.0

On a larger (made-up) DataFrame:

np.random.seed(0)

ix = pd.date_range('2015', '2020', freq='H', inclusive='left')
df = pd.DataFrame({'value': np.random.randint(0, 100, len(ix))}, index=ix)

# (code above)

>>> newdf
year                 2015  2016  2017  2018  2019
2016-01-01 00:00:00  44.0  82.0  96.0  68.0  71.0
2016-01-01 01:00:00  47.0  99.0  54.0  44.0  71.0
2016-01-01 02:00:00  64.0  28.0  11.0  10.0  55.0
...                   ...   ...   ...   ...   ...
2016-12-31 21:00:00   0.0  30.0  28.0  53.0  14.0
2016-12-31 22:00:00  47.0  82.0  19.0   6.0  64.0
2016-12-31 23:00:00  22.0  75.0  13.0  37.0  35.0

and, as expected, only 2016 has values for 02/29:

>>> newdf[:'2016-02-29 02:00:00'].tail()
year                 2015  2016  2017  2018  2019
2016-02-28 22:00:00  74.0  54.0  22.0  17.0  39.0
2016-02-28 23:00:00  37.0  61.0  31.0   8.0  62.0
2016-02-29 00:00:00   NaN  34.0   NaN   NaN   NaN
2016-02-29 01:00:00   NaN  82.0   NaN   NaN   NaN
2016-02-29 02:00:00   NaN  67.0   NaN   NaN   NaN