I have an issue while trying to concat and use set
on multiple columns.
This is an example df:
df = pd.DataFrame({'customer id':[1,2,3,4,5],
'email1':['[email protected]',np.nan,'[email protected]',np.nan, np.nan],
'email2':['[email protected]' ,np.nan,'[email protected]','[email protected]', np.nan],
'email3':['[email protected]',np.nan,'[email protected]','[email protected]', '[email protected]']})
df:
customer id email1 email2 email3
0 1 [email protected] [email protected] [email protected]
1 2 NaN NaN NaN
2 3 [email protected] [email protected] [email protected]
3 4 NaN [email protected] [email protected]
4 5 NaN NaN [email protected]
I would like to create a new column with unique values from all columns (email1, email2 & email3) so the created columns will have a set of unique emails per customer, some emails have different cases (upper, lower .. etc)
This is what I did so far:
df['ALL_EMAILS'] = df[['email1','email2','email3']].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
This took about 3 minutes on a df of > 500K customers!
then I created a function to handle the output and get the unique values if the cell is not null:
def checkemail(x):
if x:
#to_lower
lower_x = x.lower()
y= lower_x.split(',')
return set(y)
then applies it to the column:
df['ALL_EMAILS'] = df['ALL_EMAILS'].apply(checkemail)
but I got wrong output under ALL_EMAILS column!
ALL_EMAILS
0 { [email protected], [email protected], [email protected]}
1 None
2 { [email protected], [email protected]}
3 { [email protected], [email protected]}
4 {[email protected]}
CodePudding user response:
Lets filter
the email like columns then stack
to convert to series then transform into lowercase and aggregate with set
on level=0
email = df.filter(like='email')
df['all_emails'] = email.stack().str.lower().groupby(level=0).agg(set)
customer id email1 email2 email3 all_emails
0 1 [email protected] [email protected] [email protected] {[email protected], [email protected]}
1 2 NaN NaN NaN NaN
2 3 [email protected] [email protected] [email protected] {[email protected], [email protected]}
3 4 NaN [email protected] [email protected] {[email protected]}
4 5 NaN NaN [email protected] {[email protected]}
CodePudding user response:
Try work on the values directly instead of joining them then split again:
df['ALL_EMAILS'] = df.filter(like='email').apply(lambda x: set(x.dropna().str.lower()) or None, axis=1)
Output:
customer id email1 email2 email3 ALL_EMAILS
0 1 [email protected] [email protected] [email protected] {[email protected], [email protected]}
1 2 NaN NaN NaN None
2 3 [email protected] [email protected] [email protected] {[email protected], [email protected]}
3 4 NaN [email protected] [email protected] {[email protected]}
4 5 NaN NaN [email protected] {[email protected]}