Home > Mobile >  Pandas allocation / delegation of unique values in a dataframe column to a user from a list
Pandas allocation / delegation of unique values in a dataframe column to a user from a list

Time:09-01

I have a Dataframe of email addresses with their domains. I have a list of users 1-5

users = [1, 2, 3, 4, 5]

I need to allocate each unique domain to a user id, I need to ensure that multiples of the same domain are always allocated to the same user id, however the individual domains can be allocated to any user id as long as these somewhat evenly distributed across the users.

My Dataframe:

   email                  first_name    last_name     domain          
0  [email protected]       Herschel      Krustofsky    gmail.com       
1  [email protected]        Robert        Terwilliger   hotmail.com     
2  [email protected]    Homer         Simpson       email.com       
3  [email protected]     Bart          Simpson       gmail.com       
4  [email protected]     Moe           Szyslak       moestavern.com   
5  [email protected]      Marge         Simpson       simpson.net     
6  [email protected]   Lisa          Simpson       sax.com         
7  [email protected]      Itchy         And           hotmail.com     
8  [email protected]      Scratchy      Show          work.net        
9  [email protected]     Maggie        Simpson       hotmail.com     
10 [email protected]    Seymour       Skinner       teacher.net     

My desired outcome.

   email                  first_name    last_name     domain           user_id
0  [email protected]       Herschel      Krustofsky    gmail.com        1
1  [email protected]        Robert        Terwilliger   hotmail.com      2
2  [email protected]    Homer         Simpson       email.com        3
3  [email protected]     Bart          Simpson       gmail.com        1
4  [email protected]     Moe           Szyslak       moestavern.com   4
5  [email protected]      Marge         Simpson       simpson.net      5
6  [email protected]   Lisa          Simpson       sax.com          1
7  [email protected]      Itchy         And           hotmail.com      2
8  [email protected]      Scratchy      Show          work.net         3
9  [email protected]     Maggie        Simpson       hotmail.com      2
10 [email protected]    Seymour       Skinner       teacher.net      4

Incrementing the user id might not be the best approach as my example user 5 seems low in comparison?

CodePudding user response:

Firstly, to get the unique domains as a dataframe:

unique = pd.DataFrame(df['domain'].drop_duplicates().reset_index(drop=True))

           domain
0       gmail.com
1     hotmail.com
2       email.com
3  moestavern.com
4     simpson.net
5         sax.com
6        work.net
7     teacher.net

Then using numpy with a list of users, you can assign each domain one of the 5 users:

IDs = np.array([1, 2, 3, 4, 5])
unique['user_id'] = np.resize(IDs, unique.shape[0])

               domain  user_id
0       gmail.com        1
1     hotmail.com        2
2       email.com        3
3  moestavern.com        4
4     simpson.net        5
5         sax.com        1
6        work.net        2
7     teacher.net        3

You can then merge on this to get the id for each row:

df.merge(unique, on='domain')

or using a dictionary with replace:

ids = {unique.loc[i, 'domain']:unique.loc[i, 'user_id'] for i in range(len(unique))}
    df['user_id'] = df['domain'].replace(ids)


   email                  first_name    last_name     domain           user_id
0  [email protected]       Herschel      Krustofsky    gmail.com        1
1  [email protected]        Robert        Terwilliger   hotmail.com      2
2  [email protected]    Homer         Simpson       email.com        3
3  [email protected]     Bart          Simpson       gmail.com        1
4  [email protected]     Moe           Szyslak       moestavern.com   4
5  [email protected]      Marge         Simpson       simpson.net      5
6  [email protected]   Lisa          Simpson       sax.com          1
7  [email protected]      Itchy         And           hotmail.com      2
8  [email protected]      Scratchy      Show          work.net         2
9  [email protected]     Maggie        Simpson       hotmail.com      2
10 [email protected]    Seymour       Skinner       teacher.net      3

(This doesn't fully match your example, so please tell me if I've missed something).

Full code:

unique = pd.DataFrame(df['domain'].drop_duplicates().reset_index(drop=True))
IDs = np.array([1, 2, 3, 4, 5])
unique['user_id'] = np.resize(IDs, unique.shape[0])
ids = {unique.loc[i, 'domain']:unique.loc[i, 'user_id'] for i in range(len(unique))}
df['user_id'] = df['domain'].replace(ids)
  • Related