Home > database >  Creating labels and updating a column based on multiple conditions
Creating labels and updating a column based on multiple conditions

Time:11-29

I have a data frame which looks like this:

data = {
    'user_id': [
        '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX',
        '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX'
    ],
    'timestamp': [
        1612139269, 1612139665, 1612139579,
        1612141096, 1612143046, 1612143729
    ],
    'type': [
        'productDetails', 'productDetails', 'checkout:confirmation',
        'checkout:confirmation', 'productList', 'checkout:confirmation'
    ],
    'session': [0,1,2,3,4,5],
    'count_session_products': [4, 1, 0, 4, 2, 2],
    'loyalty' : [0,0,0,0,0,0]
}
test_df = pd.DataFrame(data)
test_df
user_id timestamp type session prods loyalty
9EPWZVMNP6D6KWX 1612139269 productDetails 0 4 0
9EPWZVMNP6D6KWX 1612139665 productDetails 1 1 0
9EPWZVMNP6D6KWX 1612139579 checkout:confirmation 2 0 0
9EPWZVMNP6D6KWX 1612141096 checkout:confirmation 3 4 0
9EPWZVMNP6D6KWX 1612143046 productList 4 2 0
9EPWZVMNP6D6KWX 1612143729 checkout:confirmation 5 2 0

Now I want to create loyalty labels and their conditions such as:

  • first_time_visitor - any user with session = 0

  • frequent_visitor - any user with session > 0 and count_session_products > 0

  • first_time_customer - first time checkout:confirmation appears in the type column

  • repeat_customer - second time of checkout:confirmation appears in the type column

  • loyal_customer - third time of checkout:confirmation appears in the type column

I have already have the conditions for first_time_visitor and frequent_visitor but I am having trouble creating first_time_customer, repeat_customer and loyal_customer labels.

Conditions for first_time_visitor and frequent_visitor are as follows:

test_df['loyalty'] = np.where((test_df['session'] > 0) & ((test_df['type'] != 'checkout:confirmation')), 'frequent_visitor', None)
test_df.loc[test_df['session'] == 0, 'loyalty'] = 'first_time_visitor'

which gives me a dataframe looking like this:

user_id timestamp type session prods loyalty
9EPWZVMNP6D6KWX 1612139269 productDetails 0 4 first_time_visitor
9EPWZVMNP6D6KWX 1612139665 productDetails 1 1 frequent_visitor
9EPWZVMNP6D6KWX 1612139579 checkout:confirmation 2 0 0
9EPWZVMNP6D6KWX 1612141096 checkout:confirmation 3 4 0
9EPWZVMNP6D6KWX 1612143046 productList 4 2 frequent_visitor
9EPWZVMNP6D6KWX 1612143729 checkout:confirmation 5 2 0

I have had a couple of ideas only, one being to use first_valid_index() or argmax() to find the index and somehow use that in a condition to create the first_time_customer label. But I am not sure how to implement these conditions.

(test_df.type.values == 'checkout:confirmation').argmax()
test_df[test_df.type == 'checkout:confirmation'].first_valid_index()

In the end, I would expect my loyalty column to look like this:

user_id timestamp type session prods loyalty
9EPWZVMNP6D6KWX 1612139269 productDetails 0 4 first_time_visitor
9EPWZVMNP6D6KWX 1612139665 productDetails 1 1 frequent_visitor
9EPWZVMNP6D6KWX 1612139579 checkout:confirmation 2 0 first_time_customer
9EPWZVMNP6D6KWX 1612141096 checkout:confirmation 3 4 repeat_customer
9EPWZVMNP6D6KWX 1612143046 productList 4 2 frequent_visitor
9EPWZVMNP6D6KWX 1612143729 checkout:confirmation 5 2 loyal_customer

Any suggestions and help would be appreciated. Thank you!

CodePudding user response:

One way to do that would be:

mask = test_df.type.eq('checkout:confirmation')
mask = mask & mask.cumsum().le(3) 
test_df.loc[mask, 'loyalty'] = (
    ['first_time_customer', 'repeat_customer', 'loyal_customer'][:mask.sum()]
)

CodePudding user response:

edit: my original post had the first loop on n in reverse, I don't think we need that or it helps... i also updated it so that we don't count previous occurrences of checkout:confirmation but instead just add 1 to the last checkout:confirmation count, so we are able to skip running through as many lines.

I think I was able to get a solution that uses a double-reverse loop over the dataframe. Probably not the most efficient thing in the world, but should work...

It uses the idea that each checkout seems to move the user up a level, and so we use integers and just add 1 to the loyalty before mapping those integers to words. Since loyalty-level = 3 is the highest, we are able to bail out of the loop early if we get to that level 3.

This doesn't do anything with the visitor status, so you'll want to make sure that your two methods work together for that.

df['l'] = 0 # makes a new temp column in the df to store our loyalty calcs
for n in range(len(df)): # start to loop through the dataframe
    l = 0 # set the loyalty level to 0
    if df['type'][n] == 'checkout:confirmation': # only do this for checkouts
            l = 1 # if it is a checkout, the loyalty is now 1 (might go up)
            for i in reversed(range(n)): # start looking at the rows above the current one for more checkouts
                    if df['user_id'][n] == df['user_id'][i] and df['type'][i] == 'checkout:confirmation': # make sure the userid matches as well as the above row is checkout
                            l = min(df['l'][i] 1,3) # if our new loyalty would be > 3, just set it to 3, which is the max
                            break 
            df['l'][n] = l # set the loyalty level in row n to l

df['loyalty'] = df['l'].map({1:'first_time_customer',2:'frequent_customer',3:'loyal_customer'}) # maps the integers to text for the loyalty. gives NaN for the non-purchase rows.
  • Related