I have a dataframe df
df=pd.read_csv('https://raw.githubusercontent.com/amanaroratc/hello-world/master/x_restock.csv')
df
I want to fill the missing dates for each Product_ID
with restocking_events=0
. To start, I have created a date_range dataframe using dfdate=pd.DataFrame({'Date':pd.date_range(simple.Date.min(), simple.Date.max())})
where simple
is some master dataframe and min and max dates are '2021-11-13' and '2021-11-30'.
CodePudding user response:
Use:
#added parse_dates for datetimes
df=pd.read_csv('https://raw.githubusercontent.com/amanaroratc/hello-world/master/x_restock.csv',
parse_dates=['Date'])
First solution is for add complete range of datetimes from minimal and maximal datetimes in DataFrame.reindex
by MultiIndex.from_product
:
mux = pd.MultiIndex.from_product([df['Product_ID'].unique(),
pd.date_range(df.Date.min(), df.Date.max())],
names=['Product_ID','Dates'])
df1 = df.set_index(['Product_ID','Date']).reindex(mux, fill_value=0).reset_index()
print (df1)
Product_ID Dates restocking_events
0 1004746 2021-11-13 0
1 1004746 2021-11-14 0
2 1004746 2021-11-15 0
3 1004746 2021-11-16 1
4 1004746 2021-11-17 0
... ... ...
3379 976460 2021-11-26 1
3380 976460 2021-11-27 0
3381 976460 2021-11-28 0
3382 976460 2021-11-29 0
3383 976460 2021-11-30 0
[3384 rows x 3 columns]
Another idea with helper DataFrame:
from itertools import product
dfdate=pd.DataFrame(product(df['Product_ID'].unique(),
pd.date_range(df.Date.min(), df.Date.max())),
columns=['Product_ID','Date'])
print (dfdate)
Product_ID Date
0 1004746 2021-11-13
1 1004746 2021-11-14
2 1004746 2021-11-15
3 1004746 2021-11-16
4 1004746 2021-11-17
... ...
3379 976460 2021-11-26
3380 976460 2021-11-27
3381 976460 2021-11-28
3382 976460 2021-11-29
3383 976460 2021-11-30
[3384 rows x 2 columns]
df = dfdate.merge(df, how='left').fillna({'restocking_events':0}, downcast='int')
print (df)
Product_ID Date restocking_events
0 1004746 2021-11-13 0
1 1004746 2021-11-14 0
2 1004746 2021-11-15 0
3 1004746 2021-11-16 1
4 1004746 2021-11-17 0
... ... ...
3379 976460 2021-11-26 1
3380 976460 2021-11-27 0
3381 976460 2021-11-28 0
3382 976460 2021-11-29 0
3383 976460 2021-11-30 0
[3384 rows x 3 columns]
Or if need consecutive datetimes per groups use DataFrame.asfreq
:
df2 = (df.set_index('Date')
.groupby('Product_ID')['restocking_events']
.apply(lambda x: x.asfreq('d', fill_value=0))
.reset_index())
print (df2)
Product_ID Date restocking_events
0 112714 2021-11-15 1
1 112714 2021-11-16 1
2 112714 2021-11-17 0
3 112714 2021-11-18 1
4 112714 2021-11-19 0
... ... ...
2209 3630918 2021-11-25 0
2210 3630918 2021-11-26 0
2211 3630918 2021-11-27 0
2212 3630918 2021-11-28 0
2213 3630918 2021-11-29 1
[2214 rows x 3 columns]