Home > other >  Fill DataFrame using np.where with group by conditon
Fill DataFrame using np.where with group by conditon

Time:09-02

I have a sample df:

id           email1       email2            output
 1       [email protected]    [email protected]   [email protected]
 1       [email protected]    [email protected]   [email protected]
 1            NaN          NaN            [email protected]
 2       [email protected]    NaN            [email protected]
 2       [email protected]    NaN            [email protected]
 2           NaN          [email protected]   [email protected]
 3           NaN           NaN            [email protected]
 4           NaN          [email protected]   [email protected]

My main goal is to overwrite the output column based on multiple conditions. If email1 has more than one unique email, overwrite all the output with the corresponding IDs with email1_broken. Same goes for email2 but if both have more than one unique email, email2 takes priority and therefore output = email2_broken. Lastly, we retain the email in the current output column if both columns have an id with one unique email.

Attempt:

df['output'] = np.where(df.groupby('id')['email1'].nunique() > 1, 'email1_broken',df['output'])

df['output'] = np.where(df.groupby('id')['email2'].nunique() > 1, 'email2_broken',df['output'])

Desired df:

id           email1       email2            output
 1       [email protected]    [email protected]   email2_broken
 1       [email protected]    [email protected]   email2_broken
 1            NaN          NaN            email2_broken
 2       [email protected]    NaN            email1_broken
 2       [email protected]    NaN            email1_broken
 2           NaN          [email protected]   email1_broken
 3           NaN           NaN            [email protected]
 4           NaN          [email protected]  [email protected]

Sample data:

import pandas as pd
import numpy as np

cols = ['id','email1','email2', 'output']

data = [
[1   ,  '[email protected]' ,   '[email protected]'    , '[email protected]'],
[1   ,  '[email protected]'  ,  '[email protected]'    , '[email protected]'],
[1   ,      np.nan      ,      np.nan          , '[email protected]'],
[2   ,   '[email protected]',     np.nan          , '[email protected]'],
[2   ,   '[email protected]',     np.nan          , '[email protected]'],
[2   ,     np.nan        ,   '[email protected]'   , '[email protected]'],
[3   ,     np.nan        ,     np.nan          , '[email protected]'],
[4   ,     np.nan        ,   '[email protected]'  , '[email protected]']]

df = pd.DataFrame(data, columns=cols)

CodePudding user response:

You are very close:

df['output'] = np.where(df.groupby('id')['email1'].transform('nunique') > 1, 'email1_broken',df['output'])
df['output'] = np.where(df.groupby('id')['email2'].transform('nunique') > 1, 'email2_broken',df['output'])

Use transform to obtain the boolean array with the same shape.

CodePudding user response:

You can use np.select (the equivalent of numpy.where when multiple conditions are involved) with transform('nunique'):

g = df.groupby('id')

df['output'] = np.select(
 [g['email2'].transform('nunique').gt(1),
  g['email1'].transform('nunique').gt(1)],
 ['email2_broken', 'email1_broken'],
 df['output'])

print(df)

Output:

  id          email1          email2                   output
0   1   [email protected]   [email protected]            email2_broken
1   1   [email protected]   [email protected]            email2_broken
2   1             NaN             NaN            email2_broken
3   2  [email protected]             NaN            email1_broken
4   2  [email protected]             NaN            email1_broken
5   2             NaN   [email protected]            email1_broken
6   3             NaN             NaN  [email protected]
7   4             NaN  [email protected]  [email protected]
  • Related