I have a data frame which looks like this:
data = {
'user_id': [
'9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX',
'9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX'
],
'timestamp': [
1612139269, 1612139665, 1612139579,
1612141096, 1612143046, 1612143729
],
'type': [
'productDetails', 'productDetails', 'checkout:confirmation',
'checkout:confirmation', 'productList', 'checkout:confirmation'
],
'session': [0,1,2,3,4,5],
'count_session_products': [4, 1, 0, 4, 2, 2],
'loyalty' : [0,0,0,0,0,0]
}
test_df = pd.DataFrame(data)
test_df
user_id | timestamp | type | session | prods | loyalty |
---|---|---|---|---|---|
9EPWZVMNP6D6KWX | 1612139269 | productDetails | 0 | 4 | 0 |
9EPWZVMNP6D6KWX | 1612139665 | productDetails | 1 | 1 | 0 |
9EPWZVMNP6D6KWX | 1612139579 | checkout:confirmation | 2 | 0 | 0 |
9EPWZVMNP6D6KWX | 1612141096 | checkout:confirmation | 3 | 4 | 0 |
9EPWZVMNP6D6KWX | 1612143046 | productList | 4 | 2 | 0 |
9EPWZVMNP6D6KWX | 1612143729 | checkout:confirmation | 5 | 2 | 0 |
Now I want to create loyalty labels and their conditions such as:
first_time_visitor
- any user with session = 0frequent_visitor
- any user with session > 0 and count_session_products > 0first_time_customer
- first time checkout:confirmation appears in the type columnrepeat_customer
- second time of checkout:confirmation appears in the type columnloyal_customer
- third time of checkout:confirmation appears in the type column
I have already have the conditions for first_time_visitor
and frequent_visitor
but I am having trouble creating first_time_customer
, repeat_customer
and loyal_customer
labels.
Conditions for first_time_visitor
and frequent_visitor
are as follows:
test_df['loyalty'] = np.where((test_df['session'] > 0) & ((test_df['type'] != 'checkout:confirmation')), 'frequent_visitor', None)
test_df.loc[test_df['session'] == 0, 'loyalty'] = 'first_time_visitor'
which gives me a dataframe looking like this:
user_id | timestamp | type | session | prods | loyalty |
---|---|---|---|---|---|
9EPWZVMNP6D6KWX | 1612139269 | productDetails | 0 | 4 | first_time_visitor |
9EPWZVMNP6D6KWX | 1612139665 | productDetails | 1 | 1 | frequent_visitor |
9EPWZVMNP6D6KWX | 1612139579 | checkout:confirmation | 2 | 0 | 0 |
9EPWZVMNP6D6KWX | 1612141096 | checkout:confirmation | 3 | 4 | 0 |
9EPWZVMNP6D6KWX | 1612143046 | productList | 4 | 2 | frequent_visitor |
9EPWZVMNP6D6KWX | 1612143729 | checkout:confirmation | 5 | 2 | 0 |
I have had a couple of ideas only, one being to use first_valid_index()
or argmax()
to find the index and somehow use that in a condition to create the first_time_customer
label. But I am not sure how to implement these conditions.
(test_df.type.values == 'checkout:confirmation').argmax()
test_df[test_df.type == 'checkout:confirmation'].first_valid_index()
In the end, I would expect my loyalty column to look like this:
user_id | timestamp | type | session | prods | loyalty |
---|---|---|---|---|---|
9EPWZVMNP6D6KWX | 1612139269 | productDetails | 0 | 4 | first_time_visitor |
9EPWZVMNP6D6KWX | 1612139665 | productDetails | 1 | 1 | frequent_visitor |
9EPWZVMNP6D6KWX | 1612139579 | checkout:confirmation | 2 | 0 | first_time_customer |
9EPWZVMNP6D6KWX | 1612141096 | checkout:confirmation | 3 | 4 | repeat_customer |
9EPWZVMNP6D6KWX | 1612143046 | productList | 4 | 2 | frequent_visitor |
9EPWZVMNP6D6KWX | 1612143729 | checkout:confirmation | 5 | 2 | loyal_customer |
Any suggestions and help would be appreciated. Thank you!
CodePudding user response:
One way to do that would be:
mask = test_df.type.eq('checkout:confirmation')
mask = mask & mask.cumsum().le(3)
test_df.loc[mask, 'loyalty'] = (
['first_time_customer', 'repeat_customer', 'loyal_customer'][:mask.sum()]
)
CodePudding user response:
edit: my original post had the first loop on n
in reverse, I don't think we need that or it helps... i also updated it so that we don't count previous occurrences of checkout:confirmation but instead just add 1 to the last checkout:confirmation count, so we are able to skip running through as many lines.
I think I was able to get a solution that uses a double-reverse loop over the dataframe. Probably not the most efficient thing in the world, but should work...
It uses the idea that each checkout seems to move the user up a level, and so we use integers and just add 1 to the loyalty before mapping those integers to words. Since loyalty-level = 3 is the highest, we are able to bail out of the loop early if we get to that level 3.
This doesn't do anything with the visitor status, so you'll want to make sure that your two methods work together for that.
df['l'] = 0 # makes a new temp column in the df to store our loyalty calcs
for n in range(len(df)): # start to loop through the dataframe
l = 0 # set the loyalty level to 0
if df['type'][n] == 'checkout:confirmation': # only do this for checkouts
l = 1 # if it is a checkout, the loyalty is now 1 (might go up)
for i in reversed(range(n)): # start looking at the rows above the current one for more checkouts
if df['user_id'][n] == df['user_id'][i] and df['type'][i] == 'checkout:confirmation': # make sure the userid matches as well as the above row is checkout
l = min(df['l'][i] 1,3) # if our new loyalty would be > 3, just set it to 3, which is the max
break
df['l'][n] = l # set the loyalty level in row n to l
df['loyalty'] = df['l'].map({1:'first_time_customer',2:'frequent_customer',3:'loyal_customer'}) # maps the integers to text for the loyalty. gives NaN for the non-purchase rows.