Home > front end >  Identify missing dates on dataframe datetime column with irregular frequency and fill them with asso
Identify missing dates on dataframe datetime column with irregular frequency and fill them with asso

Time:11-12

I have this df which is a sample from the actual one. Each row represent a measurement taken with intervals of approximately 20 minutes and several other columns.

                   datetime        value
261     2022-10-21 02:00:13     0.164299
262     2022-10-21 02:20:12     0.163009
263     2022-10-21 02:40:14     0.162298
264     2022-10-21 03:00:12     0.162846
265     2022-10-24 08:46:38     1.788360
266     2022-10-24 09:06:47     1.442915
267     2022-10-24 09:27:28     1.246774
268     2022-10-24 09:46:58     0.194352

As you can see, there's a gap between 2022-10-21 03:00:12 and 2022-10-24 08:46:38. I need a way of first identify where the gaps are (there might be n of them), then fill those rows with the proper dates and nan's for the corresponding values.

Right now I'm generating the missing dates as follows:

# number of periods of 20 minutes between dates  
periods = (df.iloc[265,0] - df.iloc[264,0]).total_seconds() / 60.0 / 20

# dates to fill gap
pd.date_range(df.iloc[264,0], df.iloc[265,0], periods=periods)

Any idea of how to do this?

Edit: Expected output (I don't need the exact same minutes I added on the new rows, they are just for reference of the overall idea)

                   datetime        value
261     2022-10-21 02:00:13     0.164299
262     2022-10-21 02:20:12     0.163009
263     2022-10-21 02:40:14     0.162298
264     2022-10-21 03:00:12     0.162846
265     2022-10-21 03:20:00     NaN
266     2022-10-21 03:40:00     NaN
            ...
n-4     2022-10-24 08:26:00     NaN
n-3     2022-10-24 08:46:38     1.788360
n-2     2022-10-24 09:06:47     1.442915
n-1     2022-10-24 09:27:28     1.246774
n       2022-10-24 09:46:58     0.194352

CodePudding user response:

Try (maybe not the cleanest solution - it will try to find gaps > 40 minutes and then add 20 minute intervals with NaN values):

mask = df["datetime"].diff() > "40 minutes"

dfs = []
for idx, row in df[mask].iterrows():
    prev_time = df.loc[idx - 1, "datetime"]
    dr = pd.date_range(
        prev_time   pd.Timedelta("20 minutes"),
        row["datetime"] - pd.Timedelta("20 minutes"),
        freq="20min",
    )
    dfs.append(
        pd.DataFrame(
            [{"datetime": dr}],
            index=[idx - 1],
        )
    )

df = pd.concat([df, *dfs]).sort_index().explode("datetime")
print(df)

Prints:

               datetime     value
261 2022-10-21 02:00:13  0.164299
262 2022-10-21 02:20:12  0.163009
263 2022-10-21 02:40:14  0.162298
264 2022-10-21 03:00:12  0.162846
264 2022-10-21 03:20:12       NaN
264 2022-10-21 03:40:12       NaN

...

264 2022-10-24 08:00:12       NaN
264 2022-10-24 08:20:12       NaN
265 2022-10-24 08:46:38  1.788360
266 2022-10-24 09:06:47  1.442915
267 2022-10-24 09:27:28  1.246774
268 2022-10-24 09:46:58  0.194352
  • Related