Home > other >  Find the cumulative number of missing days for a datetime column in pandas
Find the cumulative number of missing days for a datetime column in pandas

Time:09-30

I have a sample dataframe as given below.

import pandas as pd

data = {'ID':['A', 'A', 'A','A','A','A' ,'B','B','B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-22 04:38:26', '2021-09-23 
        00:12:29','2021-09-22 11:20:58','2021-09-25 09:20:58','2021-03-11 21:20:00','2021-03- 
        11 21:25:00','2021-03-12 21:25:00', '2021-03-13 21:25:00', '2021-03-15 21:25:00']}
df1 = pd.DataFrame(data)
df1 

The snippet of it is given below. The 'Date' column is in Datetime format. enter image description here

Now, I want to find the total number of missing dates in between for each participant and print them(or create a new dataframe).

ID      Missing days

A        3       (21st,22nd and 24th September dates missing)

B        1       (14th march missing)

Any help is greatly appreciated. Thanks.

CodePudding user response:

Answer below will fail with multiple consecutive missing days (Thanks Ben T). We can solve this by using resample per group, than count the NaT:

dfg = df1.groupby("ID").apply(lambda x: x.resample(rule="D", on="Date").first())
dfg["Date"].isna().sum(level=0).reset_index(name="Missing days")
  ID  Missing days
0  A             2
1  B             1

** OLD ANSWER **

We can use GroupBy.diff and check how many diffs are greater than 1 day:

df1["Date"] = pd.to_datetime(df1["Date"])
(
    df1.groupby("ID")["Date"]
    .apply(lambda x: x.diff().gt(pd.Timedelta(1, "D")).sum())
    .reset_index(name="Missing days")
)
  ID  Missing days
0  A             2
1  B             1
  • Related