Home > Enterprise >  Pandas Insert missing dates values with mutiples IDs
Pandas Insert missing dates values with mutiples IDs

Time:04-29

I have a pandas dataframe, with 1.7 million of rows. Like this:

ID date value
10 2022-01-01 100
10 2022-01-02 150
10 2022-01-05 200
10 2022-01-07 150
10 2022-01-12 100
23 2022-02-01 490
23 2022-02-03 350
23 2022-02-04 333
23 2022-02-08 211
23 2022-02-09 100

I would like to insert the missing dates in the column date. Like this:

ID date value
10 2022-01-01 100
10 2022-01-02 150
10 2022-01-03 0
10 2022-01-04 0
10 2022-01-05 200
10 2022-01-06 0
10 2022-01-07 150
10 2022-01-08 0
10 2022-01-09 0
10 2022-01-10 0
10 2022-01-11 0
10 2022-01-12 100
23 2022-02-01 490
10 2022-02-02 0
23 2022-02-03 350
23 2022-02-04 333
´´ 10 2022-02-05
10 2022-02-06 0
10 2022-02-07 0
23 2022-02-08 211
23 2022-02-09 100

I used:

s = (pd.MultiIndex.from_tuples([[x, d]
      for x, y in df.groupby("Id")["Dt"]
      for d in pd.date_range(min(y), max(df["Dt"]), freq="MS")], names=["Id", "Dt"]))

print (df.set_index(["Id", "Dt"]).reindex(s, fill_value=0).reset_index())

But, It took too long. Is there a more performative way to do this?

CodePudding user response:

Use asfreq and fillna:

#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"])

df = df.set_index("date").asfreq("D").fillna({"value": "0"}).ffill().reset_index()

>>> df
         date    ID  value
0  2022-01-01  10.0  100.0
1  2022-01-02  10.0  150.0
2  2022-01-03  10.0      0
3  2022-01-04  10.0      0
4  2022-01-05  10.0  200.0
5  2022-01-06  10.0      0
6  2022-01-07  10.0  150.0
7  2022-01-08  10.0      0
8  2022-01-09  10.0      0
9  2022-01-10  10.0      0
10 2022-01-11  10.0      0
11 2022-01-12  10.0  100.0
12 2022-01-13  10.0      0
13 2022-01-14  10.0      0
14 2022-01-15  10.0      0
15 2022-01-16  10.0      0
16 2022-01-17  10.0      0
17 2022-01-18  10.0      0
18 2022-01-19  10.0      0
19 2022-01-20  10.0      0
20 2022-01-21  10.0      0
21 2022-01-22  10.0      0
22 2022-01-23  10.0      0
23 2022-01-24  10.0      0
24 2022-01-25  10.0      0
25 2022-01-26  10.0      0
26 2022-01-27  10.0      0
27 2022-01-28  10.0      0
28 2022-01-29  10.0      0
29 2022-01-30  10.0      0
30 2022-01-31  10.0      0
31 2022-02-01  23.0  490.0
32 2022-02-02  23.0      0
33 2022-02-03  23.0  350.0
34 2022-02-04  23.0  333.0
35 2022-02-05  23.0      0
36 2022-02-06  23.0      0
37 2022-02-07  23.0      0
38 2022-02-08  23.0  211.0
39 2022-02-09  23.0  100.0

CodePudding user response:

You can try:

    df['date'] = pd.to_datetime(df['date'])
    df = (df.groupby('ID')['date'].apply(lambda d:
                               pd.date_range(start=d.min(),end=d.max()).to_list())
            .explode().reset_index()
            .merge(df, on=['ID','date'],how='left'))

    df['value'] = df['value'].fillna(0).astype(int)

Output:

    ID  date        value
0   10  2022-01-01  100
1   10  2022-01-02  150
2   10  2022-01-03  0
3   10  2022-01-04  0
4   10  2022-01-05  200
5   10  2022-01-06  0
6   10  2022-01-07  150
7   10  2022-01-08  0
8   10  2022-01-09  0
9   10  2022-01-10  0
10  10  2022-01-11  0
11  10  2022-01-12  100
12  23  2022-02-01  490
13  23  2022-02-02  0
14  23  2022-02-03  350
15  23  2022-02-04  333
16  23  2022-02-05  0
17  23  2022-02-06  0
18  23  2022-02-07  0
19  23  2022-02-08  211
20  23  2022-02-09  100
  • Related