I have a dataset that has an ID value, month, and sales.
Example: ["2020-01-01", 1, 2.0], ["2020-02-01", 1, 3.0], etc...
I have a few hundred thousand different IDs and only the months ranging from "2019-12-01"-"2020-04-01" and "2021-12-01"-"2021-04-01".
I am trying to impute 0.0 for all IDs that is missing a month between 2019-12-01 to 2021-04-01.
My attempted code is:
months = ["2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-07-01", "2020-08-01", "2020-09-01", "2020-10-01", "2020-11-01", "2020-12-01",
"2021-01-01", "2021-02-01", "2021-03-01", "2021-04-01", "2021-05-01", "2021-06-01", "2021-07-01", "2021-08-01", "2021-09-01", "2021-10-01", "2021-11-01"]
lst = []
for id_ in df_1["ID"].unique().tolist():
for month in months:
if ((df_2['ID'] == id_) & (df_2['month_year'] == month)).any():
pass
else:
lst.append([id_, month, 0])
But this code is way too inefficient.
Any ideas how to speed this up?
For reference, df_2 is all the unique IDs and month combinations.
df_1:
ID month_year Sales_Qty
11/2/2200001340 2020-03-01 4.0
11/2/2200001340 2020-04-01 63.0
11/2/2200001340 2021-02-01 22.0
11/2/2200001973 2020-02-01 2.0
CodePudding user response:
Try this
f = lambda x : x.reindex(months).fillna(0)
df_out = df_1.set_index('month_year').groupby('ID')['Sales_Qty'].apply(f).reset_index()
Grouping by id and using the unique list of months as an index should be more efficient.