Home > OS >  Filling missing dates on a DataFrame across different groups
Filling missing dates on a DataFrame across different groups

Time:02-20

Let's say I have the following DataFrame in Pandas

date customer attended
2022-01-01 John True
2022-01-02 John True
2022-01-04 John True
2022-01-05 Mark True

what transformations could I do to fill in the missing gaps within the dates on a given frequency (daily in this case, but it can by any other) so it ends up like this:

date customer attended
2022-01-01 John True
2022-01-01 Mark False
2022-01-02 John True
2022-01-02 Mark False
2022-01-03 John False
2022-01-03 Mark False
2022-01-04 John True
2022-01-04 Mark False
2022-01-05 John False
2022-01-05 Mark True

This can be done on an individual level by filtering on just one customer and doing an outer join with another DataFrame that has all the dates, and it will fill the empty ones with NaNs, but I can't do that with all the different people at customer which is what I need to do. I'd appreciate an approach that's as computationally efficient as possible and doesn't involve a lot of convoluted iterations over the dataset.

CodePudding user response:

Let's try it with pivot date_range reindex stack:

tmp = df.pivot('date','customer','attended')
tmp.index = pd.to_datetime(tmp.index)
out = tmp.reindex(pd.date_range(tmp.index[0], tmp.index[-1])).fillna(False).stack().reset_index().rename(columns={0:'attended'})

Output:

     level_0 customer  attended
0 2022-01-01     John      True
1 2022-01-01     Mark     False
2 2022-01-02     John      True
3 2022-01-02     Mark     False
4 2022-01-03     John     False
5 2022-01-03     Mark     False
6 2022-01-04     John      True
7 2022-01-04     Mark     False
8 2022-01-05     John     False
9 2022-01-05     Mark      True

CodePudding user response:

Here's one way (df your dataframe):

df.date = pd.to_datetime(df.date)  # Just in case

customers = df.customer.unique()
df_fill = pd.DataFrame(
    (
        [day, customer]
        for day in pd.date_range(df.date.min(), df.date.max(), freq="D")
        for customer in customers
    ),
    columns=["date", "customer"]
)
df = df_fill.merge(df, on=["date", "customer"], how="left")
df.attended = df.attended.fillna(False)

Output:

        date customer  attended
0 2022-01-01     John      True
1 2022-01-01     Mark     False
2 2022-01-02     John      True
3 2022-01-02     Mark     False
4 2022-01-03     John     False
5 2022-01-03     Mark     False
6 2022-01-04     John      True
7 2022-01-04     Mark     False
8 2022-01-05     John     False
9 2022-01-05     Mark      True

CodePudding user response:

One option is with the complete function from pyjanitor, to expose the explicitly missing rows:

# pip install pyjanitor
import pandas as pd
import janitor

# generate new dates
new_dates = {"date": pd.date_range(df.date.min(), df.date.max(), freq="D")}

df.complete(new_dates, 'customer', sort = True).fillna({'attended':False})

        date customer  attended
0 2022-01-01     John      True
1 2022-01-01     Mark     False
2 2022-01-02     John      True
3 2022-01-02     Mark     False
4 2022-01-03     John     False
5 2022-01-03     Mark     False
6 2022-01-04     John      True
7 2022-01-04     Mark     False
8 2022-01-05     John     False
9 2022-01-05     Mark      True
  • Related