Home > Software design >  Add missing dates to pandas dataframe with zeros as values
Add missing dates to pandas dataframe with zeros as values

Time:08-22

Very standard task: data can have multiple events on a given date or no events. I to take these events, enrich the in the data missing dates in given range and zero values, so later I can a count by date and plot the data.

I'm new to pandas and have followed reindex to add missing dates to pandas dataframe and pandas fill missing dates in time series. Although it was accepted by a number of stackoverflow readers, it does not work for me, as it Nullifies already existing 'quantity' column values:

CODE: 
    # (items_per_day is coming from DB)
    date_now = datetime.now().date()
    date_period = date_now-timedelta(weeks=1)

    data = list(items_per_day)
    print('data: ', data)

    idx = pd.period_range(date_period, date_now)

    df = pd.DataFrame(data)
    print('df: ', df)

    df.index = pd.DatetimeIndex(df.index)
    print('df index: ', df)

    df = df.reindex(idx, fill_value=0)
    print('df: ', df)

OUTPUT:

data:  [{'date': datetime.date(2022, 8, 17), 'quantity': 1}, {'date': datetime.date(2022, 8, 18), 'quantity': 2}, {'date': datetime.date(2022, 8, 19), 'quantity': 3}]

df:           date  quantity
0  2022-08-17         1
1  2022-08-18         2
2  2022-08-19         3

df index:                                       date  quantity
1970-01-01 00:00:00.000000000  2022-08-17         1
1970-01-01 00:00:00.000000001  2022-08-18         2
1970-01-01 00:00:00.000000002  2022-08-19         3

df:             date  quantity
2022-08-13    0         0
2022-08-14    0         0
2022-08-15    0         0
2022-08-16    0         0
2022-08-17    0         0
2022-08-18    0         0
2022-08-19    0         0
2022-08-20    0         0

It looks like that reindex losses the data with or without 'fill_value' param! Although Add missing dates pandas dataframe solution based on different principal works:

CODE:
    date_now = datetime.now().date()
    date_period = date_now-timedelta(weeks=1)

    data = list(items_per_day)
    print('data: ', data)

    idx = pd.period_range(date_period, date_now)

    df = pd.DataFrame(data)
    print('df: ', df)

    df['date'] = pd.to_datetime(df['date'])

    dates = pd.date_range(start=date_period, freq='D', periods=7)
    df = df.set_index('date').reindex(dates.normalize().rename('date')).reset_index()
    df['quantity'] = df['quantity'].fillna(0)
    print('df: ', df)

OUTPUT:
data:  [{'date': datetime.date(2022, 8, 17), 'quantity': 1}, {'date': datetime.date(2022, 8, 18), 'quantity': 2}, {'date': datetime.date(2022, 8, 19), 'quantity': 3}]

df:           date  quantity
0  2022-08-17         1
1  2022-08-18         2
2  2022-08-19         3

df:     date  quantity
0 2022-08-13       0.0
1 2022-08-14       0.0
2 2022-08-15       0.0
3 2022-08-16       0.0
4 2022-08-17       1.0
5 2022-08-18       2.0
6 2022-08-19       3.0

I would like to understand why the first solution does not work in my case (it obviously worked for a number of people who voted to it!). What is the root cause of the failure in my case?

CodePudding user response:

In your first approach, you are reindexing a DatetimeIndex with a PeriodIndex(created by period_range), use date_range instead of period_range works:

idx = pd.date_range(date_period, date_now)
df.index = pd.DatetimeIndex(df.date)

df.reindex(idx, fill_value=0)
#                  date  quantity
#2022-08-13           0         0
#2022-08-14           0         0
#2022-08-15           0         0
#2022-08-16           0         0
#2022-08-17  2022-08-17         1
#2022-08-18  2022-08-18         2
#2022-08-19  2022-08-19         3
#2022-08-20           0         0
  • Related