Home > Enterprise >  Imputing missing rows more efficiently
Imputing missing rows more efficiently

Time:02-22

I have a dataset that has an ID value, month, and sales.

Example: ["2020-01-01", 1, 2.0], ["2020-02-01", 1, 3.0], etc...

I have a few hundred thousand different IDs and only the months ranging from "2019-12-01"-"2020-04-01" and "2021-12-01"-"2021-04-01".

I am trying to impute 0.0 for all IDs that is missing a month between 2019-12-01 to 2021-04-01.

My attempted code is:

months = ["2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-07-01", "2020-08-01", "2020-09-01", "2020-10-01", "2020-11-01", "2020-12-01",
                   "2021-01-01", "2021-02-01", "2021-03-01", "2021-04-01", "2021-05-01", "2021-06-01", "2021-07-01", "2021-08-01", "2021-09-01", "2021-10-01", "2021-11-01"]

lst = []
for id_ in df_1["ID"].unique().tolist():
  for month in months:
    if ((df_2['ID'] == id_) & (df_2['month_year'] == month)).any():
      pass
    else:
      lst.append([id_, month, 0])

But this code is way too inefficient.

Any ideas how to speed this up?

For reference, df_2 is all the unique IDs and month combinations.

df_1:

          ID    month_year  Sales_Qty
11/2/2200001340 2020-03-01  4.0
11/2/2200001340 2020-04-01  63.0
11/2/2200001340 2021-02-01  22.0
11/2/2200001973 2020-02-01  2.0

CodePudding user response:

Try this

f = lambda x : x.reindex(months).fillna(0)
df_out = df_1.set_index('month_year').groupby('ID')['Sales_Qty'].apply(f).reset_index()

Grouping by id and using the unique list of months as an index should be more efficient.

  • Related