Flag subset of a dataframe based on another dataframe values-CodePudding

I've encountered a problem which I didn't succeed to solve, for now. Your assistance on this one would be highly appreciated.

I have 2 dataframes:

first_df:
   A  B  C  D
1  1  a  q  zz
2  2  b  w  xx
3  3  c  e  yy
4  4  d  r  vv

second_df:
   C1 C2
1  10  a 
2  20  b 
3  70  g 
4  80  h

What I want to achieve is mark rows from first_df, based on values from second_df. Marking those rows should be based on comparing column values:

(first_df['A'] == second_df['C1'] * 10) & (first_df['B'] == second_df['C2'])

Expected output should be like this:

compared_df:
   A  B  C  D   Match
1  1  a  q  zz  True
2  2  b  w  xx  True
3  3  c  e  yy  False
4  4  d  r  vv  False

Could you please point me in a right direction? Which pandas methods should I use to achieve my goal? I've tried many things, but I'm pandas beginner, so it's hard to assess if those attempts were correct.

CodePudding user response：

First create Multiindex on both the dataframes then use MultiIndex.isin to test for the occurrence of the index values of first dataframe in the index of second dataframe in order the create boolean flag:

i1 = first_df.set_index([first_df['A'] * 10, 'B']).index
i2 = second_df.set_index(['C1', 'C2']).index

first_df['Match'] = i1.isin(i2)

Result

print(first_df)

   A  B  C   D  Match
1  1  a  q  zz   True
2  2  b  w  xx   True
3  3  c  e  yy  False
4  4  d  r  vv  False

CodePudding user response：

The code you wrote is almost completely correct; I think you just multiplied the wrong column by 10:

first_df['Match'] = (first_df['A'] * 10 == second_df['C1']) & (first_df['B'] == second_df['C2'])

Output:

>>> first_df
   A  B  C   D  Match
1  1  a  q  zz   True
2  2  b  w  xx   True
3  3  c  e  yy  False
4  4  d  r  vv  False

CodePudding user response：

You can use Series.isin:

In [2516]: first_df['Match'] = first_df['A'].mul(10).isin(second_df['C1'])

In [2517]: df1
Out[2517]: 
   A  B  C   D  Match
1  1  a  q  zz   True
2  2  b  w  xx   True
3  3  c  e  yy  False
4  4  d  r  vv  False

CodePudding user response：

If you don't mind writing it as sql code and you can import libraries to your project, you might use pandasql library.

import pandasql as psql

compared_df =  psql.sqldf("""
select f.A, f.B, f.C, f.D, 
CASE WHEN s.C1 is not null then 1 else 0 end as Match
FROM first_df f 
LEFT JOIN second_df s ON (f.A *10 = s.C1 AND f.B = s.C2)
""")

CodePudding user response：

I think you could use merge here where you use the indicator parameter to see if a match is made or not.

out = df1.assign(A=df1['A']*10).merge(df2, left_on=['A','B'], right_on=['C1','C2'], 
                                      how='left', indicator=True)
out['Match'] = out['_merge'].replace({'both':True, 'left_only':False})
out = out.drop(columns=['C1','C2','_merge'])

Output:

    A  B  C   D  Match
0  10  a  q  zz   True
1  20  b  w  xx   True
2  30  c  e  yy  False
3  40  d  r  vv  False