Creating labels and updating a column based on multiple conditions-CodePudding

I have a data frame which looks like this:

data = {
    'user_id': [
        '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX',
        '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX'
    ],
    'timestamp': [
        1612139269, 1612139665, 1612139579,
        1612141096, 1612143046, 1612143729
    ],
    'type': [
        'productDetails', 'productDetails', 'checkout:confirmation',
        'checkout:confirmation', 'productList', 'checkout:confirmation'
    ],
    'session': [0,1,2,3,4,5],
    'count_session_products': [4, 1, 0, 4, 2, 2],
    'loyalty' : [0,0,0,0,0,0]
}
test_df = pd.DataFrame(data)
test_df

user_id	timestamp	type	session	prods
9EPWZVMNP6D6KWX	1612139269	productDetails	0	4
9EPWZVMNP6D6KWX	1612139665	productDetails	1	1
9EPWZVMNP6D6KWX	1612139579	checkout:confirmation	2	0
9EPWZVMNP6D6KWX	1612141096	checkout:confirmation	3	4
9EPWZVMNP6D6KWX	1612143046	productList	4	2
9EPWZVMNP6D6KWX	1612143729	checkout:confirmation	5	2

Now I want to create loyalty labels and their conditions such as:

first_time_visitor - any user with session = 0
frequent_visitor - any user with session > 0 and count_session_products > 0
first_time_customer - first time checkout:confirmation appears in the type column
repeat_customer - second time of checkout:confirmation appears in the type column
loyal_customer - third time of checkout:confirmation appears in the type column

I have already have the conditions for first_time_visitor and frequent_visitor but I am having trouble creating first_time_customer, repeat_customer and loyal_customer labels.

Conditions for first_time_visitor and frequent_visitor are as follows:

test_df['loyalty'] = np.where((test_df['session'] > 0) & ((test_df['type'] != 'checkout:confirmation')), 'frequent_visitor', None)
test_df.loc[test_df['session'] == 0, 'loyalty'] = 'first_time_visitor'

which gives me a dataframe looking like this:

user_id	timestamp	type	session	prods	loyalty
9EPWZVMNP6D6KWX	1612139269	productDetails	0	4	first_time_visitor
9EPWZVMNP6D6KWX	1612139665	productDetails	1	1	frequent_visitor
9EPWZVMNP6D6KWX	1612139579	checkout:confirmation	2	0	0
9EPWZVMNP6D6KWX	1612141096	checkout:confirmation	3	4	0
9EPWZVMNP6D6KWX	1612143046	productList	4	2	frequent_visitor
9EPWZVMNP6D6KWX	1612143729	checkout:confirmation	5	2	0

I have had a couple of ideas only, one being to use first_valid_index() or argmax() to find the index and somehow use that in a condition to create the first_time_customer label. But I am not sure how to implement these conditions.

(test_df.type.values == 'checkout:confirmation').argmax()
test_df[test_df.type == 'checkout:confirmation'].first_valid_index()

In the end, I would expect my loyalty column to look like this:

user_id	timestamp	type	session	prods	loyalty
9EPWZVMNP6D6KWX	1612139269	productDetails	0	4	first_time_visitor
9EPWZVMNP6D6KWX	1612139665	productDetails	1	1	frequent_visitor
9EPWZVMNP6D6KWX	1612139579	checkout:confirmation	2	0	first_time_customer
9EPWZVMNP6D6KWX	1612141096	checkout:confirmation	3	4	repeat_customer
9EPWZVMNP6D6KWX	1612143046	productList	4	2	frequent_visitor
9EPWZVMNP6D6KWX	1612143729	checkout:confirmation	5	2	loyal_customer

Any suggestions and help would be appreciated. Thank you!

CodePudding user response：

One way to do that would be:

mask = test_df.type.eq('checkout:confirmation')
mask = mask & mask.cumsum().le(3) 
test_df.loc[mask, 'loyalty'] = (
    ['first_time_customer', 'repeat_customer', 'loyal_customer'][:mask.sum()]
)

CodePudding user response：

edit: my original post had the first loop on n in reverse, I don't think we need that or it helps... i also updated it so that we don't count previous occurrences of checkout:confirmation but instead just add 1 to the last checkout:confirmation count, so we are able to skip running through as many lines.

I think I was able to get a solution that uses a double-reverse loop over the dataframe. Probably not the most efficient thing in the world, but should work...

It uses the idea that each checkout seems to move the user up a level, and so we use integers and just add 1 to the loyalty before mapping those integers to words. Since loyalty-level = 3 is the highest, we are able to bail out of the loop early if we get to that level 3.

This doesn't do anything with the visitor status, so you'll want to make sure that your two methods work together for that.

df['l'] = 0 # makes a new temp column in the df to store our loyalty calcs
for n in range(len(df)): # start to loop through the dataframe
    l = 0 # set the loyalty level to 0
    if df['type'][n] == 'checkout:confirmation': # only do this for checkouts
            l = 1 # if it is a checkout, the loyalty is now 1 (might go up)
            for i in reversed(range(n)): # start looking at the rows above the current one for more checkouts
                    if df['user_id'][n] == df['user_id'][i] and df['type'][i] == 'checkout:confirmation': # make sure the userid matches as well as the above row is checkout
                            l = min(df['l'][i] 1,3) # if our new loyalty would be > 3, just set it to 3, which is the max
                            break 
            df['l'][n] = l # set the loyalty level in row n to l

df['loyalty'] = df['l'].map({1:'first_time_customer',2:'frequent_customer',3:'loyal_customer'}) # maps the integers to text for the loyalty. gives NaN for the non-purchase rows.