I have a sample df:
id email1 email2 output
1 [email protected] [email protected] [email protected]
1 [email protected] [email protected] [email protected]
1 NaN NaN [email protected]
2 [email protected] NaN [email protected]
2 [email protected] NaN [email protected]
2 NaN [email protected] [email protected]
3 NaN NaN [email protected]
4 NaN [email protected] [email protected]
My main goal is to overwrite the output
column based on multiple conditions. If email1
has more than one unique email, overwrite all the output with the corresponding IDs with email1_broken
. Same goes for email2
but if both have more than one unique email, email2
takes priority and therefore output = email2_broken
. Lastly, we retain the email in the current output
column if both columns have an id with one unique email.
Attempt:
df['output'] = np.where(df.groupby('id')['email1'].nunique() > 1, 'email1_broken',df['output'])
df['output'] = np.where(df.groupby('id')['email2'].nunique() > 1, 'email2_broken',df['output'])
Desired df:
id email1 email2 output
1 [email protected] [email protected] email2_broken
1 [email protected] [email protected] email2_broken
1 NaN NaN email2_broken
2 [email protected] NaN email1_broken
2 [email protected] NaN email1_broken
2 NaN [email protected] email1_broken
3 NaN NaN [email protected]
4 NaN [email protected] [email protected]
Sample data:
import pandas as pd
import numpy as np
cols = ['id','email1','email2', 'output']
data = [
[1 , '[email protected]' , '[email protected]' , '[email protected]'],
[1 , '[email protected]' , '[email protected]' , '[email protected]'],
[1 , np.nan , np.nan , '[email protected]'],
[2 , '[email protected]', np.nan , '[email protected]'],
[2 , '[email protected]', np.nan , '[email protected]'],
[2 , np.nan , '[email protected]' , '[email protected]'],
[3 , np.nan , np.nan , '[email protected]'],
[4 , np.nan , '[email protected]' , '[email protected]']]
df = pd.DataFrame(data, columns=cols)
CodePudding user response:
You are very close:
df['output'] = np.where(df.groupby('id')['email1'].transform('nunique') > 1, 'email1_broken',df['output'])
df['output'] = np.where(df.groupby('id')['email2'].transform('nunique') > 1, 'email2_broken',df['output'])
Use transform to obtain the boolean array with the same shape.
CodePudding user response:
You can use np.select
(the equivalent of numpy.where
when multiple conditions are involved) with transform('nunique')
:
g = df.groupby('id')
df['output'] = np.select(
[g['email2'].transform('nunique').gt(1),
g['email1'].transform('nunique').gt(1)],
['email2_broken', 'email1_broken'],
df['output'])
print(df)
Output:
id email1 email2 output
0 1 [email protected] [email protected] email2_broken
1 1 [email protected] [email protected] email2_broken
2 1 NaN NaN email2_broken
3 2 [email protected] NaN email1_broken
4 2 [email protected] NaN email1_broken
5 2 NaN [email protected] email1_broken
6 3 NaN NaN [email protected]
7 4 NaN [email protected] [email protected]