I'm trying to append zeroes to product id's data which are not present for any particular date and my code takes lot of time to append zeroes. Looking for an alternate approach in pandas/numpy.
Here is the sample data:
rpt_date product_id total_views total_cart_adds total_order_unit total_gmv
30-07-2022 mp000000006243574 7 1 0 0
30-07-2022 mp000000006292285 1 0 0 0
30-07-2022 mp000000006294016 18 1 0 0
31-07-2022 mp000000006243574 8 2 0 0
31-07-2022 mp000000006292285 5 0 0 0
For Eg if data for product id 'mp000000006294016' is not present on 31 or any other day then it should append 0 for the respective columns
Below is my code
df_ans=pd.DataFrame()
for x in prod_df['product_id']:
df2=pd.DataFrame()
print(prod_df[prod_df['product_id']==x])
df2 = prod_df[prod_df['product_id']==x]
if df2.shape[0] == 1:
formatted_date1 = time.strptime(prod_df['rpt_date'][0], "%d-%m-%Y")
formatted_date2 = time.strptime('30-07-2022', "%d-%m-%Y")
if formatted_date1==formatted_date2:
df2.loc[-1] = ['31-07-2022', x, '0', '0','0','0'] # adding a row
df2.index = df2.index 1 # shifting index
df2 = df2.sort_index()
else:
df2.loc[-1] = ['30-07-2022', x, '0', '0','0','0'] # adding a row
df2.index = df2.index 1 # shifting index
df2 = df2.sort_index()
print(df2)
df_ans= pd.concat([df_ans,df2])
print("***************************************")
CodePudding user response:
You should almost certainly be solving this problem in the query that generated the data to begin with. But one way to work around it is to create a "blank" dataFrame with all the rpt_date
and product_id
combinations, then concat
that to the original data and drop all the duplicate rows:
df_prid = prod_df[['product_id']].drop_duplicates()
df_date = prod_df[['rpt_date']].drop_duplicates()
df_blank = df_date.merge(df_prid, how='cross')
df_blank[['total_views', 'total_cart_adds', 'total_order_unit', 'total_gmv']] = [0,0,0,0]
df_final = pd.concat([prod_df, df_blank]).drop_duplicates(subset=['rpt_date', 'product_id'], keep='first').reset_index()
Output:
index rpt_date product_id total_views total_cart_adds total_order_unit total_gmv
0 0 30-07-2022 mp000000006243574 7 1 0 0
1 1 30-07-2022 mp000000006292285 1 0 0 0
2 2 30-07-2022 mp000000006294016 18 1 0 0
3 3 31-07-2022 mp000000006243574 8 2 0 0
4 4 31-07-2022 mp000000006292285 5 0 0 0
5 5 31-07-2022 mp000000006294016 0 0 0 0
CodePudding user response:
First convert values to datetimes for correct ordering, create MultiIndex
and add missing categories by combinations all rpt_date
and product_id
by DataFrame.reindex
, last convert to original format in Series.dt.strftime
:
df['rpt_date'] = pd.to_datetime(df['rpt_date'], dayfirst=True)
mux = pd.MultiIndex.from_product([df['rpt_date'].unique(), df['product_id'].unique()],
names=['rpt_date','product_id'])
df = df.set_index(['rpt_date','product_id']).reindex(mux, fill_value=0).reset_index()
df['rpt_date'] = df['rpt_date'].dt.strftime("%d-%m-%Y")
print (df)
rpt_date product_id total_views total_cart_adds \
0 30-07-2022 mp000000006243574 7 1
1 30-07-2022 mp000000006292285 1 0
2 30-07-2022 mp000000006294016 18 1
3 31-07-2022 mp000000006243574 8 2
4 31-07-2022 mp000000006292285 5 0
5 31-07-2022 mp000000006294016 0 0
total_order_unit total_gmv
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0