Home > OS >  Apply np.where or np.select to multiple column pairs
Apply np.where or np.select to multiple column pairs

Time:10-14

Given a data df as follows:

import pandas as pd

data = [[1, 'A1', 'A1'], [2, 'A2', 'B2', 1, 1], [3, 'B3', 'B3', 3, 2], [4, None, None]]
df = pd.DataFrame(data, columns=['id', 'v1','v2','v3','v4'])
print(df)

Out:

   id    v1    v2   v3   v4
0   1    A1    A1  NaN  NaN
1   2    A2    B2  1.0  1.0
2   3    B3    B3  3.0  2.0
3   4  None  None  NaN  NaN

Let's say I need to check if multiple column pairs have identical content or same values:

col_pair = {'v1': 'v2', 'v3': 'v4'}

If I don't want to repeat np.where multiple times as follow, instead, I hope to apply col_pair or other possible solutions, how could I acheive that? Thanks.

df['v1_v2'] = np.where(df['v1'] == df['v2'], 1, 0)
df['v3_v4'] = np.where(df['v3'] == df['v4'], 1, 0)

The expected result:

   id    v1    v2   v3   v4 v1_v2 v3_v4
0   1    A1    A1  NaN  NaN     1   NaN
1   2    A2    B2  1.0  1.0     0     1
2   3    B3    B3  3.0  2.0     1     0
3   4  None  None  NaN  NaN   NaN   NaN

CodePudding user response:

You need test also if both values in pair key-value are missing in DataFrame.isna with DataFrame.all and passed to numpy.select:

for k, v in col_pair.items():
    df[f'{k}_{v}'] = np.select([df[[k, v]].isna().all(axis=1),
                                df[k] == df[v]], [None,1], default=0)

Out:

   id    v1    v2   v3   v4 v1_v2 v3_v4
0   1    A1    A1  NaN  NaN     1  None
1   2    A2    B2  1.0  1.0     0     1
2   3    B3    B3  3.0  2.0     1     0
3   4  None  None  NaN  NaN  None  None
  • Related