I'm a newbie, so please excuse me if I use incorrect terms. I have a df with customer purchasing info and customers are identified by a unique user_id. Each item a user_id buys in each transaction creates a new row (if a customer buys 5 products in 1 transaction, 5 different rows are created with that products info).
I have created customer profiles based on 4 variables (income, age, dept id & parental status) using the loc function. It has worked, however, the outcome isn't what I want. There are 106,143 customers in the df and 30,964,564 rows. The profiles I created (young parent, young single adult, higher earner, over 60, other ['other' to catch anything not assigned one of the other profiles]) are being assigned to each row, rather than to each user_id e.g. user_id 1 buys 5 items, 1 of which matches the conditions of 'young parent', the rest are assigned 'other'. This is my code:
# create flag for 'High earner' (age: 40-59, income: 400000 , department_id: 1, 4, 7, 19, 16, parental_status: Parent)
final_df.loc[(final_df['age_range'].isin(['40-49', '50-59'])) & (final_df['income'] >= 400000) & (final_df['department_id'].isin([1, 4, 7, 19, 16])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Higher earner'
# create flag for 'Young single adult' (age: <39, income: <=199999, department_id: 1, 4, 7, 19, parental_status: Non-parent)
final_df.loc[(final_df['age'] <= 39) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([1, 4, 7, 19])) & (final_df['parental_status'] == 'Non-parent'), 'customer_profile'] = 'Young single adult'
# create flag for 'Young parent' (age: 20-39, income: <=199999, department_id: 4, 13, 16, 17, 18 parental_status: Parent)
final_df.loc[(final_df['age_range'].isin(['20-29', '30-39'])) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([4, 13, 16, 17, 18])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Young parent'
# create flag for 'Over 60' (age: 60 , income: <=199999, department_id: 1, 4, 11, 12, 15, 20 parental_status: Parent)
final_df.loc[(final_df['age'] >= 60) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([1, 4, 11, 12, 15, 20])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Over 60'
# impute all NaN values as 'Other'
final_df['customer_profile'].fillna('Other', inplace=True)
This is the result:
user_id customer_profile
0 1 Other
1 1 Other
2 1 Other
3 1 Other
4 1 Other
5 1 Other
6 1 Other
7 1 Other
8 1 Other
9 1 Other
10 1 Young parent
11 1 Other
12 1 Other
13 1 Other
14 1 Other
15 1 Other
16 1 Other
17 1 Other
18 1 Other
19 1 Other
20 1 Other
21 1 Other
22 1 Other
23 1 Young parent
24 1 Young parent
What I actually want is, "if 'young parent' (or any profile) is assigned even once to a user_id, then all 'other' for that user_id must be changed to 'young parent' too" (a customer cannot have 2 profiles!). So, the above results should show 'young parent' in each row.
Is this possible? Am I using the wrong function? My knowledge is limited and any advice would be appreciated!
CodePudding user response:
Mask
the Other
values in the customer_profile
column, then group
the column by user_id
and transform
with first
to select the first non-nan value per user_id
m = df['customer_profile'].eq('Other')
df['customer_profile'] = df['customer_profile'].mask(m)\
.groupby(df['user_id']).transform('first')
To further simplify this you can skip the final step in your code where you are using fillna
to fill the Other
values because to use groupby we have to mask this values back to NaN
. So fillna
is a redundant step.
df['customer_profile'] = df.groupby('user_id')['customer_profile'].transform('first')
user_id customer_profile
0 1 Young parent
1 1 Young parent
2 1 Young parent
3 1 Young parent
4 1 Young parent
5 1 Young parent
6 1 Young parent
7 1 Young parent
...
23 1 Young parent
24 1 Young parent