I have a Dataframe of email addresses with their domains. I have a list of users 1-5
users = [1, 2, 3, 4, 5]
I need to allocate each unique domain to a user id, I need to ensure that multiples of the same domain are always allocated to the same user id, however the individual domains can be allocated to any user id as long as these somewhat evenly distributed across the users.
My Dataframe:
email first_name last_name domain
0 [email protected] Herschel Krustofsky gmail.com
1 [email protected] Robert Terwilliger hotmail.com
2 [email protected] Homer Simpson email.com
3 [email protected] Bart Simpson gmail.com
4 [email protected] Moe Szyslak moestavern.com
5 [email protected] Marge Simpson simpson.net
6 [email protected] Lisa Simpson sax.com
7 [email protected] Itchy And hotmail.com
8 [email protected] Scratchy Show work.net
9 [email protected] Maggie Simpson hotmail.com
10 [email protected] Seymour Skinner teacher.net
My desired outcome.
email first_name last_name domain user_id
0 [email protected] Herschel Krustofsky gmail.com 1
1 [email protected] Robert Terwilliger hotmail.com 2
2 [email protected] Homer Simpson email.com 3
3 [email protected] Bart Simpson gmail.com 1
4 [email protected] Moe Szyslak moestavern.com 4
5 [email protected] Marge Simpson simpson.net 5
6 [email protected] Lisa Simpson sax.com 1
7 [email protected] Itchy And hotmail.com 2
8 [email protected] Scratchy Show work.net 3
9 [email protected] Maggie Simpson hotmail.com 2
10 [email protected] Seymour Skinner teacher.net 4
Incrementing the user id might not be the best approach as my example user 5 seems low in comparison?
CodePudding user response:
Firstly, to get the unique domains as a dataframe:
unique = pd.DataFrame(df['domain'].drop_duplicates().reset_index(drop=True))
domain
0 gmail.com
1 hotmail.com
2 email.com
3 moestavern.com
4 simpson.net
5 sax.com
6 work.net
7 teacher.net
Then using numpy with a list of users, you can assign each domain one of the 5 users:
IDs = np.array([1, 2, 3, 4, 5])
unique['user_id'] = np.resize(IDs, unique.shape[0])
domain user_id
0 gmail.com 1
1 hotmail.com 2
2 email.com 3
3 moestavern.com 4
4 simpson.net 5
5 sax.com 1
6 work.net 2
7 teacher.net 3
You can then merge on this to get the id for each row:
df.merge(unique, on='domain')
or using a dictionary with replace:
ids = {unique.loc[i, 'domain']:unique.loc[i, 'user_id'] for i in range(len(unique))}
df['user_id'] = df['domain'].replace(ids)
email first_name last_name domain user_id
0 [email protected] Herschel Krustofsky gmail.com 1
1 [email protected] Robert Terwilliger hotmail.com 2
2 [email protected] Homer Simpson email.com 3
3 [email protected] Bart Simpson gmail.com 1
4 [email protected] Moe Szyslak moestavern.com 4
5 [email protected] Marge Simpson simpson.net 5
6 [email protected] Lisa Simpson sax.com 1
7 [email protected] Itchy And hotmail.com 2
8 [email protected] Scratchy Show work.net 2
9 [email protected] Maggie Simpson hotmail.com 2
10 [email protected] Seymour Skinner teacher.net 3
(This doesn't fully match your example, so please tell me if I've missed something).
Full code:
unique = pd.DataFrame(df['domain'].drop_duplicates().reset_index(drop=True))
IDs = np.array([1, 2, 3, 4, 5])
unique['user_id'] = np.resize(IDs, unique.shape[0])
ids = {unique.loc[i, 'domain']:unique.loc[i, 'user_id'] for i in range(len(unique))}
df['user_id'] = df['domain'].replace(ids)