I have a pandas dataframe, with 1.7 million of rows. Like this:
ID | date | value |
---|---|---|
10 | 2022-01-01 | 100 |
10 | 2022-01-02 | 150 |
10 | 2022-01-05 | 200 |
10 | 2022-01-07 | 150 |
10 | 2022-01-12 | 100 |
23 | 2022-02-01 | 490 |
23 | 2022-02-03 | 350 |
23 | 2022-02-04 | 333 |
23 | 2022-02-08 | 211 |
23 | 2022-02-09 | 100 |
I would like to insert the missing dates in the column date. Like this:
ID | date | value |
---|---|---|
10 | 2022-01-01 | 100 |
10 | 2022-01-02 | 150 |
10 | 2022-01-03 | 0 |
10 | 2022-01-04 | 0 |
10 | 2022-01-05 | 200 |
10 | 2022-01-06 | 0 |
10 | 2022-01-07 | 150 |
10 | 2022-01-08 | 0 |
10 | 2022-01-09 | 0 |
10 | 2022-01-10 | 0 |
10 | 2022-01-11 | 0 |
10 | 2022-01-12 | 100 |
23 | 2022-02-01 | 490 |
10 | 2022-02-02 | 0 |
23 | 2022-02-03 | 350 |
23 | 2022-02-04 | 333 |
´´ | 10 | 2022-02-05 |
10 | 2022-02-06 | 0 |
10 | 2022-02-07 | 0 |
23 | 2022-02-08 | 211 |
23 | 2022-02-09 | 100 |
I used:
s = (pd.MultiIndex.from_tuples([[x, d]
for x, y in df.groupby("Id")["Dt"]
for d in pd.date_range(min(y), max(df["Dt"]), freq="MS")], names=["Id", "Dt"]))
print (df.set_index(["Id", "Dt"]).reindex(s, fill_value=0).reset_index())
But, It took too long. Is there a more performative way to do this?
CodePudding user response:
Use asfreq
and fillna
:
#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date").asfreq("D").fillna({"value": "0"}).ffill().reset_index()
>>> df
date ID value
0 2022-01-01 10.0 100.0
1 2022-01-02 10.0 150.0
2 2022-01-03 10.0 0
3 2022-01-04 10.0 0
4 2022-01-05 10.0 200.0
5 2022-01-06 10.0 0
6 2022-01-07 10.0 150.0
7 2022-01-08 10.0 0
8 2022-01-09 10.0 0
9 2022-01-10 10.0 0
10 2022-01-11 10.0 0
11 2022-01-12 10.0 100.0
12 2022-01-13 10.0 0
13 2022-01-14 10.0 0
14 2022-01-15 10.0 0
15 2022-01-16 10.0 0
16 2022-01-17 10.0 0
17 2022-01-18 10.0 0
18 2022-01-19 10.0 0
19 2022-01-20 10.0 0
20 2022-01-21 10.0 0
21 2022-01-22 10.0 0
22 2022-01-23 10.0 0
23 2022-01-24 10.0 0
24 2022-01-25 10.0 0
25 2022-01-26 10.0 0
26 2022-01-27 10.0 0
27 2022-01-28 10.0 0
28 2022-01-29 10.0 0
29 2022-01-30 10.0 0
30 2022-01-31 10.0 0
31 2022-02-01 23.0 490.0
32 2022-02-02 23.0 0
33 2022-02-03 23.0 350.0
34 2022-02-04 23.0 333.0
35 2022-02-05 23.0 0
36 2022-02-06 23.0 0
37 2022-02-07 23.0 0
38 2022-02-08 23.0 211.0
39 2022-02-09 23.0 100.0
CodePudding user response:
You can try:
df['date'] = pd.to_datetime(df['date'])
df = (df.groupby('ID')['date'].apply(lambda d:
pd.date_range(start=d.min(),end=d.max()).to_list())
.explode().reset_index()
.merge(df, on=['ID','date'],how='left'))
df['value'] = df['value'].fillna(0).astype(int)
Output:
ID date value
0 10 2022-01-01 100
1 10 2022-01-02 150
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200
5 10 2022-01-06 0
6 10 2022-01-07 150
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100
12 23 2022-02-01 490
13 23 2022-02-02 0
14 23 2022-02-03 350
15 23 2022-02-04 333
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211
20 23 2022-02-09 100