shift pandas dataframe rows to the nearest timestep-CodePudding

Background

I have a large dataframe consisting of 100's of thousands of values. The head of dataframe looks as below

df = pd.DataFrame([np.nan, 1100, 1400, np.nan, 14000],
                   index=pd.to_datetime(["2011-05-25 10:00:00",
                                         "2011-05-25 16:40:00",
                                         "2011-05-25 17:06:00",
                                         "2011-05-25 17:10:00",
                                         "2011-05-25 17:24:00"])

                           0
2011-05-25 10:00:00      NaN
2011-05-25 16:40:00   1100.0
2011-05-25 17:06:00   1400.0
2011-05-25 17:10:00      NaN
2011-05-25 17:24:00  14000.0

what I want

The values are not always recorded with 6 minute time step. I want to shift the values which are not recorded at 6 minute time step to the nearest 6 minute step. I want the new dataframe to look like below

n_df = pd.DataFrame([np.nan, 1100, 1400, np.nan, 14000],
                   index=pd.to_datetime(["2011-05-25 10:00:00",
                                         "2011-05-25 16:42:00",
                                         "2011-05-25 17:06:00",
                                         "2011-05-25 17:12:00",
                                         "2011-05-25 17:24:00"])
                   )

                           0
2011-05-25 10:00:00      NaN
2011-05-25 16:42:00   1100.0
2011-05-25 17:06:00   1400.0
2011-05-25 17:12:00      NaN
2011-05-25 17:24:00  14000.0

What is important for me that all the values in n_df should be at 6 minute time step and thus the attribute n_df.index.freq must not be None.

How can I achieve this.

So far I am doing it using for loop by iterating over df and finding the nearest 6 minute step and shifting/copying the value to that step, but this is very slow if df is larger than 1000.

what I have tried


    index = pd.date_range(df.index[0], end=df.index[-1], freq="6min")
    pydatetime_index = index.to_pydatetime()
    n_df = pd.DataFrame(columns=df.columns, index=index)

    for _idx, i in enumerate(df.index):
        nearest_neighbor = np.abs(pydatetime_index - i.to_pydatetime())
        idx = np.argmin(nearest_neighbor)
        val = df.loc[i]
        n_df.iloc[idx] = val

CodePudding user response：

You can use merge_asof with nearest and specify tolerance parameter:

index = pd.date_range(df.index[0], end=df.index[-1], freq="6min")
df1 = pd.DataFrame(index=index)

df2 = pd.merge_asof(df1, 
                    df, 
                    left_index=True, 
                    right_index=True, 
                    direction='nearest', 
                    tolerance=pd.Timedelta('3Min'))
print (df2)
                           0
2011-05-25 10:00:00      NaN
2011-05-25 10:06:00      NaN
2011-05-25 10:12:00      NaN
2011-05-25 10:18:00      NaN
2011-05-25 10:24:00      NaN
                     ...
2011-05-25 17:00:00      NaN
2011-05-25 17:06:00   1400.0
2011-05-25 17:12:00      NaN
2011-05-25 17:18:00      NaN
2011-05-25 17:24:00  14000.0

[75 rows x 1 columns]

Or similar with DataFrame.reindex:

df2 = df.reindex(index, method='nearest', tolerance=pd.Timedelta('3Min'))
print (df2)
                           0
2011-05-25 10:00:00      NaN
2011-05-25 10:06:00      NaN
2011-05-25 10:12:00      NaN
2011-05-25 10:18:00      NaN
2011-05-25 10:24:00      NaN
                     ...
2011-05-25 17:00:00      NaN
2011-05-25 17:06:00   1400.0
2011-05-25 17:12:00      NaN
2011-05-25 17:18:00      NaN
2011-05-25 17:24:00  14000.0

[75 rows x 1 columns]

Or:

df2 = df.resample('6Min').first()
print (df2)
                           0
2011-05-25 10:00:00      NaN
2011-05-25 10:06:00      NaN
2011-05-25 10:12:00      NaN
2011-05-25 10:18:00      NaN
2011-05-25 10:24:00      NaN
                     ...
2011-05-25 17:00:00      NaN
2011-05-25 17:06:00   1400.0
2011-05-25 17:12:00      NaN
2011-05-25 17:18:00      NaN
2011-05-25 17:24:00  14000.0

[75 rows x 1 columns]