Home > database >  concat strings from different columns and get unique values in pandas df
concat strings from different columns and get unique values in pandas df

Time:09-15

I have an issue while trying to concat and use set on multiple columns.

This is an example df:

df = pd.DataFrame({'customer id':[1,2,3,4,5],
                   'email1':['[email protected]',np.nan,'[email protected]',np.nan, np.nan],
                   'email2':['[email protected]'   ,np.nan,'[email protected]','[email protected]', np.nan],
                   'email3':['[email protected]',np.nan,'[email protected]','[email protected]', '[email protected]']})  

df:

   customer id          email1          email2          email3
0            1  [email protected]  [email protected]  [email protected]
1            2             NaN             NaN             NaN
2            3  [email protected]   [email protected]   [email protected]
3            4             NaN   [email protected]   [email protected]
4            5             NaN             NaN   [email protected]

I would like to create a new column with unique values from all columns (email1, email2 & email3) so the created columns will have a set of unique emails per customer, some emails have different cases (upper, lower .. etc)

This is what I did so far:

df['ALL_EMAILS'] = df[['email1','email2','email3']].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)

This took about 3 minutes on a df of > 500K customers!

then I created a function to handle the output and get the unique values if the cell is not null:

def checkemail(x):
    if x:
        #to_lower
        lower_x = x.lower()
        y= lower_x.split(',')
        return set(y)

then applies it to the column:

df['ALL_EMAILS'] = df['ALL_EMAILS'].apply(checkemail)

but I got wrong output under ALL_EMAILS column!

   ALL_EMAILS
0  { [email protected], [email protected],  [email protected]}  
1                                               None  
2                   { [email protected], [email protected]}  
3                    { [email protected], [email protected]}  
4                                    {[email protected]}  

CodePudding user response:

Lets filter the email like columns then stack to convert to series then transform into lowercase and aggregate with set on level=0

email = df.filter(like='email')
df['all_emails'] = email.stack().str.lower().groupby(level=0).agg(set)

   customer id          email1          email2          email3                        all_emails
0            1  [email protected]  [email protected]  [email protected]  {[email protected], [email protected]}
1            2             NaN             NaN             NaN                               NaN
2            3  [email protected]   [email protected]   [email protected]   {[email protected], [email protected]}
3            4             NaN   [email protected]   [email protected]                   {[email protected]}
4            5             NaN             NaN   [email protected]                   {[email protected]}

CodePudding user response:

Try work on the values directly instead of joining them then split again:

df['ALL_EMAILS'] = df.filter(like='email').apply(lambda x: set(x.dropna().str.lower()) or None, axis=1)

Output:

   customer id          email1          email2          email3                        ALL_EMAILS
0            1  [email protected]  [email protected]  [email protected]  {[email protected], [email protected]}
1            2             NaN             NaN             NaN                              None
2            3  [email protected]   [email protected]   [email protected]   {[email protected], [email protected]}
3            4             NaN   [email protected]   [email protected]                   {[email protected]}
4            5             NaN             NaN   [email protected]                   {[email protected]}
  • Related