Home > database >  Compare three columns of pandas df using np.where
Compare three columns of pandas df using np.where

Time:02-26

I have DataFrame as given:

TXN_DATE_TIME   TX_ID   CUST_ID STATE_1         STATE_2       STATE_3
01-06-2020 00:00    1   123      Maharashtra    Maharashtra   Maharashtra
01-06-2020 00:00    2   345      Pune           Chennai       Gujarat
01-06-2020 00:00    3   222      Chennai        Gujarat       Chennai
01-06-2020 00:00    4   1356     Gujarat        Chennai       Delhi
01-06-2020 00:00    5   2345     Punjab         Punjab        Delhi
01-06-2020 00:00    6   1111     Haryana        Delhi         Punjab
01-06-2020 00:00    7   5678     Delhi          Maharashtra   Haryana
01-06-2020 00:00    8   9999     Kerela         Assam         Assam
01-06-2020 00:00    9   2345     Assam          Assam         Assam
01-06-2020 00:00    10  6666     Tripura        Tripura       Tripura
01-06-2020 00:00    11  7896     Kolkatta       Kolkatta      Kolkatta

I want to create a new column match in the df containing two values Match and No match based on the following conditions:

If State_1==State_2==STATE_3 Then Match=1
Else Match=0

Hence the expected df will be:

enter image description here

I tried to use np.where in pandas df by using:

df['MATCH']=np.where(df['STATE_1']==df['STATE_2']==df['STATE_3'],1,0)

But it gives me the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I wanted to know that is there any other faster method other than np.where that can be used to achieve the expected result and if not then how can I avoid this error? ?

CodePudding user response:

Use:

df['MATCH']=(df['STATE_1']==df['STATE_2'])&(df['STATE_2']==df['STATE_3'])
df['MATCH'] = df['MATCH'].astype(int)

CodePudding user response:

We could also use eq all:

df['MATCH'] = df[['STATE_2', 'STATE_3']].eq(df['STATE_1'], axis=0).all(axis=1).astype(int)

If you have more than 3 "STATE"s, we could use filter:

df['MATCH'] = df.filter(like='STATE').eq(df['STATE_1'], axis=0).all(axis=1)

Output:

       TXN_DATE_TIME  TX_ID  CUST_ID      STATE_1      STATE_2      STATE_3   MATCH
0   01-06-2020 00:00      1      123  Maharashtra  Maharashtra  Maharashtra       1
1   01-06-2020 00:00      2      345         Pune      Chennai      Gujarat       0
2   01-06-2020 00:00      3      222      Chennai      Gujarat      Chennai       0
3   01-06-2020 00:00      4     1356      Gujarat      Chennai        Delhi       0
4   01-06-2020 00:00      5     2345       Punjab       Punjab        Delhi       0
5   01-06-2020 00:00      6     1111      Haryana        Delhi       Punjab       0
6   01-06-2020 00:00      7     5678        Delhi  Maharashtra      Haryana       0
7   01-06-2020 00:00      8     9999       Kerela        Assam        Assam       0
8   01-06-2020 00:00      9     2345        Assam        Assam        Assam       1
9   01-06-2020 00:00     10     6666      Tripura      Tripura      Tripura       1
10  01-06-2020 00:00     11     7896     Kolkatta     Kolkatta     Kolkatta       1
  • Related