I have a sample df:
id email1 email2 output
1 abc@gmail.com 123@gmail.com random_email1@gmail.com
1 xyz@gmail.com 234@gmail.com random_email1@gmail.com
1 NaN NaN random_email1@gmail.com
2 a123@gmail.com NaN random_email2@gmail.com
2 b123@gmail.com NaN random_email2@gmail.com
2 NaN lol@gmail.com random_email2@gmail.com
3 NaN NaN random_email3@gmail.com
4 NaN lolz@gmail.com random_email3@gmail.com
My main goal is to overwrite the output
column based on multiple conditions. If email1
has more than one unique email, overwrite all the output with the corresponding IDs with email1_broken
. Same goes for email2
but if both have more than one unique email, email2
takes priority and therefore output = email2_broken
. Lastly, we retain the email in the current output
column if both columns have an id with one unique email.
Attempt:
df['output'] = np.where(df.groupby('id')['email1'].nunique() > 1, 'email1_broken',df['output'])
df['output'] = np.where(df.groupby('id')['email2'].nunique() > 1, 'email2_broken',df['output'])
Desired df:
id email1 email2 output
1 abc@gmail.com 123@gmail.com email2_broken
1 xyz@gmail.com 234@gmail.com email2_broken
1 NaN NaN email2_broken
2 a123@gmail.com NaN email1_broken
2 b123@gmail.com NaN email1_broken
2 NaN lol@gmail.com email1_broken
3 NaN NaN random_email3@gmail.com
4 NaN lolz@gmail.com random_email3@gmail.com
Sample data:
import pandas as pd
import numpy as np
cols = ['id','email1','email2', 'output']
data = [
[1 , 'abc@gmail.com' , '123@gmail.com' , 'random_email1@gmail.com'],
[1 , 'xyz@gmail.com' , '234@gmail.com' , 'random_email1@gmail.com'],
[1 , np.nan , np.nan , 'random_email1@gmail.com'],
[2 , 'a123@gmail.com', np.nan , 'random_email2@gmail.com'],
[2 , 'b123@gmail.com', np.nan , 'random_email2@gmail.com'],
[2 , np.nan , 'lol@gmail.com' , 'random_email2@gmail.com'],
[3 , np.nan , np.nan , 'random_email3@gmail.com'],
[4 , np.nan , 'lolz@gmail.com' , 'random_email3@gmail.com']]
df = pd.DataFrame(data, columns=cols)
CodePudding user response:
You are very close:
df['output'] = np.where(df.groupby('id')['email1'].transform('nunique') > 1, 'email1_broken',df['output'])
df['output'] = np.where(df.groupby('id')['email2'].transform('nunique') > 1, 'email2_broken',df['output'])
Use transform to obtain the boolean array with the same shape.
CodePudding user response:
You can use np.select
(the equivalent of numpy.where
when multiple conditions are involved) with transform('nunique')
:
g = df.groupby('id')
df['output'] = np.select(
[g['email2'].transform('nunique').gt(1),
g['email1'].transform('nunique').gt(1)],
['email2_broken', 'email1_broken'],
df['output'])
print(df)
Output:
id email1 email2 output
0 1 abc@gmail.com 123@gmail.com email2_broken
1 1 xyz@gmail.com 234@gmail.com email2_broken
2 1 NaN NaN email2_broken
3 2 a123@gmail.com NaN email1_broken
4 2 b123@gmail.com NaN email1_broken
5 2 NaN lol@gmail.com email1_broken
6 3 NaN NaN random_email3@gmail.com
7 4 NaN lolz@gmail.com random_email3@gmail.com