Home > Software engineering >  Check if a row in one data frame exist in another data frame but do not merge both data frames
Check if a row in one data frame exist in another data frame but do not merge both data frames

Time:03-21

I have a data frame (df_validation) in which I have done some validation. If all was ok, this should be empty. If not, rows with info like this will be found

CHR  START  END
1    1000   2000 # Imagine this failed

Now, I want to communicate to another data frame (df) that validation was ok except for the ones that appear in df_validation

This is df

CHR  START  END
1    1000   2000
2    1000   2000
3    1000   2000
4    1000   2000
5    1000   2000

and this is the result I would like to have

CHR  START  END    Pass_validation?
1    1000   2000      False
2    1000   2000      True
3    1000   2000      True
4    1000   2000      True
5    1000   2000      True

How can I do this? The first idea was to merge both but having same column name, it adds a '_X' on left and '_Y' on right. The new right columns I don't really need, just a new column as in the example.

CodePudding user response:

Idea is use indicator=True parameter for helper column _merge and for False for match compare for not equal both. If is omit on parameter is joined by intersection of columnsname in both dataFrames, here CHR, START and END.

df2['Pass_validation?'] = df2.merge(df_validation, 
                                    indicator=True, 
                                    how='left')['_merge'].ne('both')
print (df2)
    CHR  START  END  Pass_validation?
0  1    1000   2000             False
1  2    1000   2000              True
2  3    1000   2000              True
3  4    1000   2000              True
4  5    1000   2000              True

Details:

print (df2.merge(df_validation, indicator=True, how='left'))
    CHR  START  END     _merge
0  1    1000   2000       both
1  2    1000   2000  left_only
2  3    1000   2000  left_only
3  4    1000   2000  left_only
4  5    1000   2000  left_only
  • Related