Let's say I have the following DataFrame in Pandas
date | customer | attended |
---|---|---|
2022-01-01 | John | True |
2022-01-02 | John | True |
2022-01-04 | John | True |
2022-01-05 | Mark | True |
what transformations could I do to fill in the missing gaps within the dates on a given frequency (daily in this case, but it can by any other) so it ends up like this:
date | customer | attended |
---|---|---|
2022-01-01 | John | True |
2022-01-01 | Mark | False |
2022-01-02 | John | True |
2022-01-02 | Mark | False |
2022-01-03 | John | False |
2022-01-03 | Mark | False |
2022-01-04 | John | True |
2022-01-04 | Mark | False |
2022-01-05 | John | False |
2022-01-05 | Mark | True |
This can be done on an individual level by filtering on just one customer and doing an outer join with another DataFrame that has all the dates, and it will fill the empty ones with NaNs, but I can't do that with all the different people at customer which is what I need to do. I'd appreciate an approach that's as computationally efficient as possible and doesn't involve a lot of convoluted iterations over the dataset.
CodePudding user response:
Let's try it with pivot
date_range
reindex
stack
:
tmp = df.pivot('date','customer','attended')
tmp.index = pd.to_datetime(tmp.index)
out = tmp.reindex(pd.date_range(tmp.index[0], tmp.index[-1])).fillna(False).stack().reset_index().rename(columns={0:'attended'})
Output:
level_0 customer attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True
CodePudding user response:
Here's one way (df
your dataframe):
df.date = pd.to_datetime(df.date) # Just in case
customers = df.customer.unique()
df_fill = pd.DataFrame(
(
[day, customer]
for day in pd.date_range(df.date.min(), df.date.max(), freq="D")
for customer in customers
),
columns=["date", "customer"]
)
df = df_fill.merge(df, on=["date", "customer"], how="left")
df.attended = df.attended.fillna(False)
Output:
date customer attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True
CodePudding user response:
One option is with the complete function from pyjanitor, to expose the explicitly missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
# generate new dates
new_dates = {"date": pd.date_range(df.date.min(), df.date.max(), freq="D")}
df.complete(new_dates, 'customer', sort = True).fillna({'attended':False})
date customer attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True