Home > Back-end >  Fill in missing dates for a pandas dataframe with multiple series
Fill in missing dates for a pandas dataframe with multiple series

Time:04-07

I have a dataframe that contains multiple time series, like so:

Date Item Category
2021-01-01 gouda cheese
2021-01-02 gouda cheese
2021-01-04 gouda cheese
2021-01-05 gouda cheese
2021-02-01 lettuce produce
2021-02-02 lettuce produce
2021-02-03 lettuce produce
2021-02-05 lettuce produce

I'd like to add rows for the missing dates (ex. 2021-01-03 for gouda, 2021-02-04 for lettuce). Note that these series do not necessarily start and end on the same date.

What is the best way to do this in pandas? I'd also like fill the new rows with the values in the "item" and "category" column for that series.

CodePudding user response:

Group by Item and Category, then generate a time series from the min to the max date:

result = (
    df.groupby(["Item", "Category"])["Date"]
    .apply(lambda s: pd.date_range(s.min(), s.max()))
    .explode()
    .reset_index()
)

CodePudding user response:

You can do resample

df['Date'] = pd.to_datetime(df['Date'])
df['Y-m'] = df['Date'].dt.strftime('%y-%m')
out = df.groupby('Y-m').apply(lambda x : x.set_index('Date').resample('D').ffill()).reset_index(level=1)

CodePudding user response:

This is far from optimal, but it is how I would do in order to ensure all categories and items are within the min and max periods, and all ranges are filled:

aux = []
for x in df['Item'].unique():
  _ = pd.DataFrame({'Date':pd.date_range(df[df['Item']==x]['Date'].min(),df[df['Item']==x]['Date'].max(),freq='d')})
  _['Item'] = x
  _['Category'] = df[df['Item']==x]['Category'].values[0]
  aux.append(_)
output = pd.concat(aux)  

Consider this sample dataset:

df = pd.DataFrame({'Date':['2021-01-01','2021-01-02','2021-01-04','2021-01-05','2021-01-01','2021-01-02','2021-01-04','2021-01-05'],
                   'Item':['gouda','gouda','gouda','gouda','lettuce','lettuce','lettuce','lettuce'],
                   'Category':['cheese','cheese','cheese','cheese','produce','produce','produce','produce']})
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)

Outputs:

        Date     Item Category
0 2021-01-01    gouda   cheese
1 2021-01-02    gouda   cheese
2 2021-01-03    gouda   cheese
3 2021-01-04    gouda   cheese
4 2021-01-05    gouda   cheese
0 2021-01-01  lettuce  produce
1 2021-01-02  lettuce  produce
2 2021-01-03  lettuce  produce
3 2021-01-04  lettuce  produce
4 2021-01-05  lettuce  produce
  • Related