Apologies for the vague title, I'm not entirely sure how to word it more correctly. I have a DataFrame like this:
date customerID saved purchased savedProduct purchasedProduct
0 2021-01-01 456789 1 0 11223344 [0]
1 2021-01-01 456789 1 0 55667788 [0]
2 2021-01-03 456789 0 1 0 [11223344, 28373827]
Which is created with this:
d = {'date': ['2021-01-01', '2021-01-01', '2021-01-03'], 'customerID': ['456789', '456789', '456789'], 'saved':[1, 1, 0], 'purchased': [0, 0, 1], 'savedProduct': [11223344, 55667788, 0], 'purchasedProduct': [[0], [0], [11223344, 28373827]]}
df = pd.DataFrame(data=d)
And the logic behind it is that each line is a customer record: they can only ever save one product at a time (which is why savedProduct has one product code) but they can purchase multiple products, which is why purchasedProduct contains a list. What I want to do is:
- By customerID, get unique productIDs in savedProduct
- By unique productID in this column, see if they appear in purchasedProduct
- If they appear, pull the date column from the line in which purchasedProduct appears so I can calculate the amount of days between savedProduct and purchasedProduct
So e.g., the product in line 1 appears in line 3 so preferably there'd be a way to have both the first line's date (2021-01-01) and third line's date (2021-01-03) in the same row so we can calculate difference between the dates.
I thought a nested loop would do the job but I can't get it to work (and there must be a more efficient way..):
dateDF = pd.DataFrame({'customerID': ['0'],
'savedDate': ['0'],
'purchasedDate': ['0']})
dateDF_t = pd.DataFrame()
sp = []
for x in df['customerID'].unique():
customerID = x
sp = df[df['customerID'] == x]['savedProduct'].unique()
for i in sp:
for idx, n in enumerate(df[df['customerID'] == x]['purchasedProduct']):
if i in n and i != 0:
print(df[df['customerID'] == x].iloc[idx, 1])
dateDF_t['customerID'] = df[df['customerID'] == x].iloc[idx, 1]
dateDF_t['savedDate'] = df[(df['customerID'] == x) & (df['savedProduct'] == i)]['date']
dateDF_t['purchasedDate'] = df[df['customerID'] == x].iloc[idx, 0]
dateDF = pd.concat([dateDF, dateDF_t])
But the output is like this:
customerID savedDate purchasedDate
0 0 0 0
0 NaN 2021-01-01 2021-01-03
Is there any way to do this better and also, why is customerID producing NaNs? When I have the output (the print in the loop) it works fine
Thanks for any help!
EDIT - may have just figured it out using lists instead but if someone has a more efficient way, would still be appreciated!
sp = []
customerIDs = []
savedDates = []
purchasedDates = []
for x in df['customerID'].unique():
sp = df[df['customerID'] == x]['savedProduct'].unique()
for i in sp:
for idx, n in enumerate(df[df['customerID'] == x]['purchasedProduct']):
if i in n and i != 0:
customerIDs.append(df[df['customerID'] == x].iloc[idx, 1])
savedDates.append(df[(df['customerID'] == x) & (df['savedProduct'] == i)]['date'].values[0])
purchasedDates.append(df[df['customerID'] == x].iloc[idx, 0])
savedDF = pd.DataFrame({'customerID': customerIDs,
'savedDates': savedDates,
'purchasedDates': purchasedDates})
which has the following output:
customerID savedDates purchasedDates
456789 2021-01-01 2021-01-03
2727228 2021-02-05 2021-02-09
CodePudding user response:
Try:
df=df.explode('purchasedProduct').reset_index(drop=True)
df['purchase_date'] = df.groupby('customerID').apply(
lambda df: df.apply(
lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis=1))
This will first explode the rows with lists in purchasedProducts, so it creates a seperate row for each item in the list. Then it adds a purchase date column, so you can determine on row level if and when the product is bought.
date customerID saved purchased savedProduct purchasedProduct purchase_date
2021-01-01 456789 1 0 11223344 0 2021-01-03
2021-01-01 456789 1 0 55667788 0 NaN
2021-01-03 456789 0 1 0 11223344 NaN
2021-01-03 456789 0 1 0 28373827 NaN
Of course you can filter the df to only have rows with saved products:
df.loc[df.saved==1]
date customerID saved purchased savedProduct purchasedProduct purchase_date
2021-01-01 456789 1 0 11223344 0 2021-01-03
2021-01-01 456789 1 0 55667788 0 NaN
Or with only certain columns:
df.loc[df.saved==1, ['customerID', 'savedProduct', 'date',`'purchase_date']]
customerID savedProduct date purchase_date
456789 11223344 2021-01-01 2021-01-03
456789 55667788 2021-01-01 NaN