Home > front end >  Combining dataframes with differing dates column
Combining dataframes with differing dates column

Time:11-09

I have a dataset of hourly prices where I have produced a dataframe that contains the minimum price from the previous day using:

df_min = df_hour_0[['Price_REG1', 'Price_REG2', 'Price_REG3', 
                 'Price_REG4']].between_time('00:00', '23:00').resample('d').min()

This gives me:

                       Price_REG1  Price_REG2  Price_REG3  Price_REG4
date                                                      
2020-01-01 00:00:00     25.07       25.07       25.07       25.07
2020-01-02 00:00:00     12.07       12.07       12.07       12.07
2020-01-03 00:00:00     0.14        0.14        0.14        0.14
2020-01-04 00:00:00     3.83        3.83        3.83        3.83
2020-01-05 00:00:00     25.77       25.77       25.77       25.77

Now, I want to combine this df with 24 other df's, one for each hour (hour_0 below):

                     Price_REG1  Price_REG2  ...  Price_24_3  Price_24_4
date                                         ...                        
2020-01-01 00:00:00       30.83       30.83  ...         NaN         NaN
2020-01-02 00:00:00       24.81       24.81  ...       25.88       25.88
2020-01-03 00:00:00       24.39       24.39  ...       27.69       27.69
2020-01-04 00:00:00       22.04       22.04  ...       25.70       25.70
2020-01-05 00:00:00       25.77       25.77  ...       27.37       27.37

Which I do this way:

df_hour_0 = pd.concat([df_hour_0, df_min, df_max], axis=1)

This works fine for the df from the first hour, since the dates matches. But for the other df's the date is "2020-01-01 00:01:00", "2020-01-01 00:02:00" etc.

Since the dates don't match, the pd.concat gives me two times as many observations where every other observation is null:

                     Price_REG1  Price_REG2  ...  Price_3_min  Price_4_min
date                                         ...                          
2020-01-01 00:00:00         NaN         NaN  ...          NaN          NaN
2020-01-01 01:00:00       28.78       28.78  ...          NaN          NaN
2020-01-02 00:00:00         NaN         NaN  ...        30.83        30.83
2020-01-02 01:00:00       12.07       12.07  ...          NaN          NaN
2020-01-03 00:00:00         NaN         NaN  ...        31.20        31.20

I tried to fix this by:

df_max = df_max.reset_index()
df_max = df_max.drop(['date'], axis=1)

But this only gives me the same issue but instead of every other being null the whole df_min df is just inserted at the bottom of the first df.

I want to keep the date, otherwise I guess it could be possible to reset the index in both df's and combine them by index instead of date.

Thank you.

CodePudding user response:

One option could be to normalize to the date:

dfs = [df_hour_0, df_min, df_max]
pd.concat([d.set_axis(d.index.normalize()) for d in dfs], axis=1)
  • Related