Given a data df
as follows:
import pandas as pd
data = [[1, 'A1', 'A1'], [2, 'A2', 'B2', 1, 1], [3, 'B3', 'B3', 3, 2], [4, None, None]]
df = pd.DataFrame(data, columns=['id', 'v1','v2','v3','v4'])
print(df)
Out:
id v1 v2 v3 v4
0 1 A1 A1 NaN NaN
1 2 A2 B2 1.0 1.0
2 3 B3 B3 3.0 2.0
3 4 None None NaN NaN
Let's say I need to check if multiple column pairs have identical content or same values:
col_pair = {'v1': 'v2', 'v3': 'v4'}
If I don't want to repeat np.where
multiple times as follow, instead, I hope to apply col_pair
or other possible solutions, how could I acheive that? Thanks.
df['v1_v2'] = np.where(df['v1'] == df['v2'], 1, 0)
df['v3_v4'] = np.where(df['v3'] == df['v4'], 1, 0)
The expected result:
id v1 v2 v3 v4 v1_v2 v3_v4
0 1 A1 A1 NaN NaN 1 NaN
1 2 A2 B2 1.0 1.0 0 1
2 3 B3 B3 3.0 2.0 1 0
3 4 None None NaN NaN NaN NaN
CodePudding user response:
You need test also if both values in pair key-value are missing in DataFrame.isna
with DataFrame.all
and passed to numpy.select
:
for k, v in col_pair.items():
df[f'{k}_{v}'] = np.select([df[[k, v]].isna().all(axis=1),
df[k] == df[v]], [None,1], default=0)
Out:
id v1 v2 v3 v4 v1_v2 v3_v4
0 1 A1 A1 NaN NaN 1 None
1 2 A2 B2 1.0 1.0 0 1
2 3 B3 B3 3.0 2.0 1 0
3 4 None None NaN NaN None None