I want to fill in the missing dates only if a date is missing for the particular group.
Here country
county
adds up to become a group
I have this data frame
df = pd.DataFrame({'date': ['2021-01-01','2021-01-02', '2021-01-01','2021-01-03','2021-01-01','2021-01-02', '2021-01-05','2021-01-07'],'country': ['a','a','a','a','b','b','b','b'], 'county': ['c','c','d','d','e','e','f','f'], 'sales': [1,2,1,45,2,341,14,25]})
Which looks like this
date country county sales
0 2021-01-01 a c 1
1 2021-01-02 a c 2
2 2021-01-01 a d 1
3 2021-01-03 a d 45
4 2021-01-01 b e 2
5 2021-01-02 b e 341
6 2021-01-05 b f 14
7 2021-01-07 b f 25
and I want this as output
date country county sales
0 2021-01-01 a c 1
1 2021-01-02 a c 2
2 2021-01-01 a d 1
3 2021-01-02 a d 0
4 2021-01-03 a d 2
5 2021-01-01 b e 341
6 2021-01-02 b e 14
7 2021-01-05 b f 25
8 2021-01-06 b f 0
9 2021-01-07 b f 64
In my example
- Country a and County d has one missing date 2021-01-02
- Country b and County f has one missing date 2021-01-06
so I have added the same dates and in place of sales added zero
I have gone through this Pandas filling missing dates and values within group but could not able to convert the same for my problem.
CodePudding user response:
Use GroupBy.apply
with lambd function with div.DataFrame.asfreq
:
df['date'] = pd.to_datetime(df['date'])
df = (df.set_index('date')
.groupby(['country','county'])['sales']
.apply(lambda x: x.asfreq('d', fill_value=0))
.reset_index()
[['date','country','county','sales']])
print (df)
date country county sales
0 2021-01-01 a c 1
1 2021-01-02 a c 2
2 2021-01-01 a d 1
3 2021-01-02 a d 0
4 2021-01-03 a d 45
5 2021-01-01 b e 2
6 2021-01-02 b e 341
7 2021-01-05 b f 14
8 2021-01-06 b f 0
9 2021-01-07 b f 25
CodePudding user response:
One option is with the complete function from pyjanitor, which can be helpful in exposing explicitly missing rows (and can be helpful as well in abstracting the reshaping process):
# pip install pyjanitor
import pandas as pd
import janitor
df['date'] = pd.to_datetime(df['date'])
# create dictionary of new dates per group
# (date range of the min and max for each group):
new_dates = {'date' : lambda df: pd.date_range(df.min(), df.max(), freq='1D')}
# expose the missing rows per group, with the `by` parameter:
(df.complete(new_dates, by = ['country', 'county'], sort = True)
.fillna(0, downcast = 'infer')
)
date country county sales
0 2021-01-01 00:00:00 a c 1
1 2021-01-02 00:00:00 a c 2
2 2021-01-01 00:00:00 a d 1
3 2021-01-02 00:00:00 a d 0
4 2021-01-03 00:00:00 a d 45
5 2021-01-01 00:00:00 b e 2
6 2021-01-02 00:00:00 b e 341
7 2021-01-05 00:00:00 b f 14
8 2021-01-06 00:00:00 b f 0
9 2021-01-07 00:00:00 b f 25