Home > Software engineering >  Looping through a filtered dataframe to see if a value is in a list column
Looping through a filtered dataframe to see if a value is in a list column

Time:12-15

Apologies for the vague title, I'm not entirely sure how to word it more correctly. I have a DataFrame like this:

    date     customerID saved purchased savedProduct    purchasedProduct
0   2021-01-01  456789    1       0       11223344            [0]
1   2021-01-01  456789    1       0       55667788            [0]
2   2021-01-03  456789    0       1          0       [11223344, 28373827]

Which is created with this:

d = {'date': ['2021-01-01', '2021-01-01', '2021-01-03'], 'customerID': ['456789', '456789', '456789'], 'saved':[1, 1, 0], 'purchased': [0, 0, 1], 'savedProduct': [11223344, 55667788, 0], 'purchasedProduct': [[0], [0], [11223344, 28373827]]}

df = pd.DataFrame(data=d)

And the logic behind it is that each line is a customer record: they can only ever save one product at a time (which is why savedProduct has one product code) but they can purchase multiple products, which is why purchasedProduct contains a list. What I want to do is:

  • By customerID, get unique productIDs in savedProduct
  • By unique productID in this column, see if they appear in purchasedProduct
  • If they appear, pull the date column from the line in which purchasedProduct appears so I can calculate the amount of days between savedProduct and purchasedProduct

So e.g., the product in line 1 appears in line 3 so preferably there'd be a way to have both the first line's date (2021-01-01) and third line's date (2021-01-03) in the same row so we can calculate difference between the dates.

I thought a nested loop would do the job but I can't get it to work (and there must be a more efficient way..):

    dateDF = pd.DataFrame({'customerID': ['0'],
                          'savedDate': ['0'],
                          'purchasedDate': ['0']})
    
    dateDF_t = pd.DataFrame()
    
    sp = []
    for x in df['customerID'].unique():
      customerID = x
      sp = df[df['customerID'] == x]['savedProduct'].unique()
      for i in sp:
        for idx, n in enumerate(df[df['customerID'] == x]['purchasedProduct']):
          if i in n and i != 0:
            print(df[df['customerID'] == x].iloc[idx, 1])
            dateDF_t['customerID'] = df[df['customerID'] == x].iloc[idx, 1]
            dateDF_t['savedDate'] = df[(df['customerID'] == x) & (df['savedProduct'] == i)]['date']
            dateDF_t['purchasedDate'] = df[df['customerID'] == x].iloc[idx, 0]
            dateDF = pd.concat([dateDF, dateDF_t])

But the output is like this:

customerID  savedDate   purchasedDate
0   0          0             0
0   NaN      2021-01-01   2021-01-03

Is there any way to do this better and also, why is customerID producing NaNs? When I have the output (the print in the loop) it works fine

Thanks for any help!

EDIT - may have just figured it out using lists instead but if someone has a more efficient way, would still be appreciated!

sp = []
customerIDs = []
savedDates = []
purchasedDates = []
for x in df['customerID'].unique():
  sp = df[df['customerID'] == x]['savedProduct'].unique()
  for i in sp:
    for idx, n in enumerate(df[df['customerID'] == x]['purchasedProduct']):
      if i in n and i != 0:
        customerIDs.append(df[df['customerID'] == x].iloc[idx, 1])
        savedDates.append(df[(df['customerID'] == x) & (df['savedProduct'] == i)]['date'].values[0])
        purchasedDates.append(df[df['customerID'] == x].iloc[idx, 0])
  savedDF = pd.DataFrame({'customerID': customerIDs,
             'savedDates': savedDates,
             'purchasedDates': purchasedDates})

which has the following output:

customerID  savedDates  purchasedDates
  456789    2021-01-01  2021-01-03
  2727228   2021-02-05  2021-02-09

CodePudding user response:

Try:

df=df.explode('purchasedProduct').reset_index(drop=True)
df['purchase_date'] = df.groupby('customerID').apply(
    lambda df: df.apply(
        lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis=1))

This will first explode the rows with lists in purchasedProducts, so it creates a seperate row for each item in the list. Then it adds a purchase date column, so you can determine on row level if and when the product is bought.

date        customerID  saved   purchased   savedProduct    purchasedProduct    purchase_date
2021-01-01  456789      1       0           11223344        0                   2021-01-03
2021-01-01  456789      1       0           55667788        0                   NaN
2021-01-03  456789      0       1           0               11223344            NaN
2021-01-03  456789      0       1           0               28373827            NaN

Of course you can filter the df to only have rows with saved products:

df.loc[df.saved==1]

date        customerID  saved   purchased   savedProduct    purchasedProduct    purchase_date
2021-01-01  456789      1       0           11223344        0                   2021-01-03
2021-01-01  456789      1       0           55667788        0                   NaN

Or with only certain columns:

df.loc[df.saved==1, ['customerID', 'savedProduct', 'date',`'purchase_date']]

customerID  savedProduct    date        purchase_date
456789      11223344        2021-01-01  2021-01-03
456789      55667788        2021-01-01  NaN
  • Related