Home > Mobile >  Fill missing dates in 2 level of groups in pandas
Fill missing dates in 2 level of groups in pandas

Time:12-16

I want to fill in the missing dates only if a date is missing for the particular group. Here country county adds up to become a group
I have this data frame

df = pd.DataFrame({'date': ['2021-01-01','2021-01-02', '2021-01-01','2021-01-03','2021-01-01','2021-01-02', '2021-01-05','2021-01-07'],'country': ['a','a','a','a','b','b','b','b'], 'county': ['c','c','d','d','e','e','f','f'],  'sales': [1,2,1,45,2,341,14,25]})

Which looks like this

         date country county  sales
0  2021-01-01       a      c      1
1  2021-01-02       a      c      2
2  2021-01-01       a      d      1
3  2021-01-03       a      d     45
4  2021-01-01       b      e      2
5  2021-01-02       b      e    341
6  2021-01-05       b      f     14
7  2021-01-07       b      f     25

and I want this as output

         date country county  sales
0  2021-01-01       a      c      1
1  2021-01-02       a      c      2
2  2021-01-01       a      d      1
3  2021-01-02       a      d      0
4  2021-01-03       a      d      2
5  2021-01-01       b      e    341
6  2021-01-02       b      e     14
7  2021-01-05       b      f     25
8  2021-01-06       b      f     0
9  2021-01-07       b      f     64

In my example

  1. Country a and County d has one missing date 2021-01-02
  2. Country b and County f has one missing date 2021-01-06

so I have added the same dates and in place of sales added zero

I have gone through this Pandas filling missing dates and values within group but could not able to convert the same for my problem.

CodePudding user response:

Use GroupBy.apply with lambd function with div.DataFrame.asfreq:

df['date'] = pd.to_datetime(df['date'])


df = (df.set_index('date')
        .groupby(['country','county'])['sales']
        .apply(lambda x: x.asfreq('d', fill_value=0))
        .reset_index()
        [['date','country','county','sales']])
print (df)
        date country county  sales
0 2021-01-01       a      c      1
1 2021-01-02       a      c      2
2 2021-01-01       a      d      1
3 2021-01-02       a      d      0
4 2021-01-03       a      d     45
5 2021-01-01       b      e      2
6 2021-01-02       b      e    341
7 2021-01-05       b      f     14
8 2021-01-06       b      f      0
9 2021-01-07       b      f     25

CodePudding user response:

One option is with the complete function from pyjanitor, which can be helpful in exposing explicitly missing rows (and can be helpful as well in abstracting the reshaping process):

# pip install pyjanitor
import pandas as pd
import janitor
df['date'] = pd.to_datetime(df['date'])
# create dictionary of new dates per group 
# (date range of the min and max for each group):
new_dates = {'date' : lambda df: pd.date_range(df.min(), df.max(), freq='1D')}
# expose the missing rows per group, with the `by` parameter:
(df.complete(new_dates, by = ['country', 'county'], sort = True)
   .fillna(0, downcast = 'infer')
)
 
                  date country county  sales
0  2021-01-01 00:00:00       a      c      1
1  2021-01-02 00:00:00       a      c      2
2  2021-01-01 00:00:00       a      d      1
3  2021-01-02 00:00:00       a      d      0
4  2021-01-03 00:00:00       a      d     45
5  2021-01-01 00:00:00       b      e      2
6  2021-01-02 00:00:00       b      e    341
7  2021-01-05 00:00:00       b      f     14
8  2021-01-06 00:00:00       b      f      0
9  2021-01-07 00:00:00       b      f     25
  • Related