Filling missing dates on pandas dataframe on different groups-CodePudding

Let's say I have the following dataframe in Pandas

date	customer	attended
2022-01-01	John	True
2022-01-02	John	True
2022-01-04	John	True
2022-01-05	Mark	True

what transformations could I do to fill in the missing gaps within the dates on a given frequency (daily in this case, but it can by any other) so it ends up like this:

date	customer	attended
2022-01-01	John	True
2022-01-01	Mark	False
2022-01-02	John	True
2022-01-02	Mark	False
2022-01-03	John	False
2022-01-03	Mark	False
2022-01-04	John	True
2022-01-04	Mark	False
2022-01-05	John	False
2022-01-05	Mark	True

This can be done on an individual level by filtering on just one customer and doing an outer join with another dataframe that has all the dates, and it will fill the empty ones with NaNs, but I can't do that with all the different people at customer which is what I need to do. I'd appreciate an approach that's computationally efficient as possible and doesn't involve a lot of convoluted iterations over the dataset.

CodePudding user response：

Here's one way (df your dataframe):

df.date = pd.to_datetime(df.date)  # Just in case

customers = df.customer.unique()
df_fill = pd.DataFrame(
    (
        [day, customer]
        for day in pd.date_range(df.date.min(), df.date.max(), freq="D")
        for customer in customers
    ),
    columns=["date", "customer"]
)
df = df_fill.merge(df, on=["date", "customer"], how="left")
df.attended = df.attended.fillna(False)

Output:

        date customer  attended
0 2022-01-01     John      True
1 2022-01-01     Mark     False
2 2022-01-02     John      True
3 2022-01-02     Mark     False
4 2022-01-03     John     False
5 2022-01-03     Mark     False
6 2022-01-04     John      True
7 2022-01-04     Mark     False
8 2022-01-05     John     False
9 2022-01-05     Mark      True

CodePudding user response：

Let's try it with pivot date_range reindex stack:

tmp = df.pivot('date','customer','attended')
tmp.index = pd.to_datetime(tmp.index)
out = tmp.reindex(pd.date_range(tmp.index[0], tmp.index[-1])).fillna(False).stack().reset_index().rename(columns={0:'attended'})

Output:

     level_0 customer  attended
0 2022-01-01     John      True
1 2022-01-01     Mark     False
2 2022-01-02     John      True
3 2022-01-02     Mark     False
4 2022-01-03     John     False
5 2022-01-03     Mark     False
6 2022-01-04     John      True
7 2022-01-04     Mark     False
8 2022-01-05     John     False
9 2022-01-05     Mark      True