Home > Blockchain >  Expanding and filling the dataframe for missing dates by each group
Expanding and filling the dataframe for missing dates by each group

Time:11-12

   df = pd.DataFrame({"ID":[1,1,1,2,2,2],
                   "Date":["29.12.2020","05.01.2021","15.02.2021","11.04.2021","27.05.2021","29.05.2021"],
                   "Amount":[6,5,7,9,8,7]}) 
   df["Date"] = pd.to_datetime(df["Date"])

Data:

ID Date Amount
1 29.12.2020 6
1 05.01.2021 5
1 15.02.2021 7
2 11.04.2021 9
2 27.05.2021 8
2 29.05.2021 7

Desired output:

ID Date Amount
1 01.12.2020 0
. . .
. . .
1 28.12.2020 0
1 29.12.2020 6
. . .
. . .
1 04.01.2020 6
1 05.01.2021 5
. . .
. . .
1 14.02.2021 5
1 15.02.2021 7
. . .
. . .
1 28.02.2021 7
2 01.04.2021 0
. . .
. . .
2 10.04.2021 0
2 11.04.2021 9
. . .
. . .
2 26.05.2021 9
2 27.05.2021 8
2 28.05.2021 8
2 29.05.2021 7

So, basically, I want to expand the data set for the missing dates and insert the latest of preceding amount value for every newly added date.

The date ranges of the ID groups do vary.

Plus, fill value should be 0 for the unobserved values in the month associated with the first observation (e.g. Amount column takes the value of 0 between 01.12.2020 and 28.12.2020, since the first observation of twelfth month of 2020 is on 29th day).

CodePudding user response:

I would set the df index to Date, then group by ID and finally reindex depending on the oldest (replacing it with the first day of the month) and most recent dates:

import pandas as pd

df = pd.DataFrame({"ID":[1,1,1,2,2,2],
                   "Date":["29.12.2020","05.01.2021","15.02.2021","11.04.2021","27.05.2021","29.05.2021"],
                   "Amount":[6,5,7,9,8,7]}) 
df["Date"] = pd.to_datetime(df["Date"], format="%d.%m.%Y")
df = df.set_index("Date")

new_df = pd.DataFrame()
for id_val, obs_period in df.groupby("ID"):
    date_range = pd.date_range(min(obs_period.index).replace(day=1), max(obs_period.index))
    obs_period = obs_period.reindex(date_range, fill_value=pd.NA)
    obs_period["ID"] = id_val
    if pd.isna(obs_period.at[obs_period.index[0], "Amount"]):
        obs_period.at[obs_period.index[0], "Amount"] = 0  # adding 0 at the beginning of the period if undefined
    obs_period= obs_period.ffill()      # filling Amount with last value
    new_df = pd.concat([new_df, obs_period])

print(new_df)

BTW you should specify your date format while converting df["Date"]

Output:

            ID  Amount
2020-12-01   1     0.0
2020-12-02   1     0.0
2020-12-03   1     0.0
2020-12-04   1     0.0
2020-12-05   1     0.0
...         ..     ...
2021-05-25   2     9.0
2021-05-26   2     9.0
2021-05-27   2     8.0
2021-05-28   2     8.0
2021-05-29   2     7.0

[136 rows x 2 columns]
  • Related