df = pd.DataFrame({"ID":[1,1,1,2,2,2],
"Date":["29.12.2020","05.01.2021","15.02.2021","11.04.2021","27.05.2021","29.05.2021"],
"Amount":[6,5,7,9,8,7]})
df["Date"] = pd.to_datetime(df["Date"])
Data:
ID | Date | Amount |
---|---|---|
1 | 29.12.2020 | 6 |
1 | 05.01.2021 | 5 |
1 | 15.02.2021 | 7 |
2 | 11.04.2021 | 9 |
2 | 27.05.2021 | 8 |
2 | 29.05.2021 | 7 |
Desired output:
ID | Date | Amount |
---|---|---|
1 | 01.12.2020 | 0 |
. | . | . |
. | . | . |
1 | 28.12.2020 | 0 |
1 | 29.12.2020 | 6 |
. | . | . |
. | . | . |
1 | 04.01.2020 | 6 |
1 | 05.01.2021 | 5 |
. | . | . |
. | . | . |
1 | 14.02.2021 | 5 |
1 | 15.02.2021 | 7 |
. | . | . |
. | . | . |
1 | 28.02.2021 | 7 |
2 | 01.04.2021 | 0 |
. | . | . |
. | . | . |
2 | 10.04.2021 | 0 |
2 | 11.04.2021 | 9 |
. | . | . |
. | . | . |
2 | 26.05.2021 | 9 |
2 | 27.05.2021 | 8 |
2 | 28.05.2021 | 8 |
2 | 29.05.2021 | 7 |
So, basically, I want to expand the data set for the missing dates and insert the latest of preceding amount value for every newly added date.
The date ranges of the ID groups do vary.
Plus, fill value should be 0 for the unobserved values in the month associated with the first observation (e.g. Amount column takes the value of 0 between 01.12.2020 and 28.12.2020, since the first observation of twelfth month of 2020 is on 29th day).
CodePudding user response:
I would set the df index to Date
, then group by ID
and finally reindex depending on the oldest (replacing it with the first day of the month) and most recent dates:
import pandas as pd
df = pd.DataFrame({"ID":[1,1,1,2,2,2],
"Date":["29.12.2020","05.01.2021","15.02.2021","11.04.2021","27.05.2021","29.05.2021"],
"Amount":[6,5,7,9,8,7]})
df["Date"] = pd.to_datetime(df["Date"], format="%d.%m.%Y")
df = df.set_index("Date")
new_df = pd.DataFrame()
for id_val, obs_period in df.groupby("ID"):
date_range = pd.date_range(min(obs_period.index).replace(day=1), max(obs_period.index))
obs_period = obs_period.reindex(date_range, fill_value=pd.NA)
obs_period["ID"] = id_val
if pd.isna(obs_period.at[obs_period.index[0], "Amount"]):
obs_period.at[obs_period.index[0], "Amount"] = 0 # adding 0 at the beginning of the period if undefined
obs_period= obs_period.ffill() # filling Amount with last value
new_df = pd.concat([new_df, obs_period])
print(new_df)
BTW you should specify your date format while converting df["Date"]
Output:
ID Amount
2020-12-01 1 0.0
2020-12-02 1 0.0
2020-12-03 1 0.0
2020-12-04 1 0.0
2020-12-05 1 0.0
... .. ...
2021-05-25 2 9.0
2021-05-26 2 9.0
2021-05-27 2 8.0
2021-05-28 2 8.0
2021-05-29 2 7.0
[136 rows x 2 columns]