Looping through a filtered dataframe to see if a value is in a list column-CodePudding

Apologies for the vague title, I'm not entirely sure how to word it more correctly. I have a DataFrame like this:

    date     customerID saved purchased savedProduct    purchasedProduct
0   2021-01-01  456789    1       0       11223344            [0]
1   2021-01-01  456789    1       0       55667788            [0]
2   2021-01-03  456789    0       1          0       [11223344, 28373827]

Which is created with this:

d = {'date': ['2021-01-01', '2021-01-01', '2021-01-03'], 'customerID': ['456789', '456789', '456789'], 'saved':[1, 1, 0], 'purchased': [0, 0, 1], 'savedProduct': [11223344, 55667788, 0], 'purchasedProduct': [[0], [0], [11223344, 28373827]]}

df = pd.DataFrame(data=d)

And the logic behind it is that each line is a customer record: they can only ever save one product at a time (which is why savedProduct has one product code) but they can purchase multiple products, which is why purchasedProduct contains a list. What I want to do is:

By customerID, get unique productIDs in savedProduct
By unique productID in this column, see if they appear in purchasedProduct
If they appear, pull the date column from the line in which purchasedProduct appears so I can calculate the amount of days between savedProduct and purchasedProduct

So e.g., the product in line 1 appears in line 3 so preferably there'd be a way to have both the first line's date (2021-01-01) and third line's date (2021-01-03) in the same row so we can calculate difference between the dates.

I thought a nested loop would do the job but I can't get it to work (and there must be a more efficient way..):

    dateDF = pd.DataFrame({'customerID': ['0'],
                          'savedDate': ['0'],
                          'purchasedDate': ['0']})
    
    dateDF_t = pd.DataFrame()
    
    sp = []
    for x in df['customerID'].unique():
      customerID = x
      sp = df[df['customerID'] == x]['savedProduct'].unique()
      for i in sp:
        for idx, n in enumerate(df[df['customerID'] == x]['purchasedProduct']):
          if i in n and i != 0:
            print(df[df['customerID'] == x].iloc[idx, 1])
            dateDF_t['customerID'] = df[df['customerID'] == x].iloc[idx, 1]
            dateDF_t['savedDate'] = df[(df['customerID'] == x) & (df['savedProduct'] == i)]['date']
            dateDF_t['purchasedDate'] = df[df['customerID'] == x].iloc[idx, 0]
            dateDF = pd.concat([dateDF, dateDF_t])

But the output is like this:

customerID  savedDate   purchasedDate
0   0          0             0
0   NaN      2021-01-01   2021-01-03

Is there any way to do this better and also, why is customerID producing NaNs? When I have the output (the print in the loop) it works fine

Thanks for any help!

EDIT - may have just figured it out using lists instead but if someone has a more efficient way, would still be appreciated!

sp = []
customerIDs = []
savedDates = []
purchasedDates = []
for x in df['customerID'].unique():
  sp = df[df['customerID'] == x]['savedProduct'].unique()
  for i in sp:
    for idx, n in enumerate(df[df['customerID'] == x]['purchasedProduct']):
      if i in n and i != 0:
        customerIDs.append(df[df['customerID'] == x].iloc[idx, 1])
        savedDates.append(df[(df['customerID'] == x) & (df['savedProduct'] == i)]['date'].values[0])
        purchasedDates.append(df[df['customerID'] == x].iloc[idx, 0])
  savedDF = pd.DataFrame({'customerID': customerIDs,
             'savedDates': savedDates,
             'purchasedDates': purchasedDates})

which has the following output:

customerID  savedDates  purchasedDates
  456789    2021-01-01  2021-01-03
  2727228   2021-02-05  2021-02-09

CodePudding user response：

Try:

df=df.explode('purchasedProduct').reset_index(drop=True)
df['purchase_date'] = df.groupby('customerID').apply(
    lambda df: df.apply(
        lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis=1))

This will first explode the rows with lists in purchasedProducts, so it creates a seperate row for each item in the list. Then it adds a purchase date column, so you can determine on row level if and when the product is bought.

date        customerID  saved   purchased   savedProduct    purchasedProduct    purchase_date
2021-01-01  456789      1       0           11223344        0                   2021-01-03
2021-01-01  456789      1       0           55667788        0                   NaN
2021-01-03  456789      0       1           0               11223344            NaN
2021-01-03  456789      0       1           0               28373827            NaN

Of course you can filter the df to only have rows with saved products:

df.loc[df.saved==1]

date        customerID  saved   purchased   savedProduct    purchasedProduct    purchase_date
2021-01-01  456789      1       0           11223344        0                   2021-01-03
2021-01-01  456789      1       0           55667788        0                   NaN

Or with only certain columns:

df.loc[df.saved==1, ['customerID', 'savedProduct', 'date',`'purchase_date']]

customerID  savedProduct    date        purchase_date
456789      11223344        2021-01-01  2021-01-03
456789      55667788        2021-01-01  NaN