I have two dataframe below:
df1>>
df2>>
I need to compare the two dataframe and find the id's in col1 where col2 value is a mismatch i.e. for the sample dataset 3 should be the answer.
Please note in the actual problem I have 10^6 number of unique id's in col 1 hence if-else ladder or running loop at each unique col1 key level is taking a very long execution time. I am looking for an efficient method to do this. I really appreciate any help you can provide.
CodePudding user response:
df1 = pd.DataFrame({"col1": [1, 1,2,2,3,3] ,"col2": ['A','B','C','D','E','F']})
df2 = pd.DataFrame({"col1": [1, 1,2,2,3,3] ,"col2": ['A','B','D','C','E','K']})
df3 = df1.merge(df2, indicator=True, how='outer').query('_merge != "both"').drop('_merge', 1)
df3
CodePudding user response:
You can compare dataframes with pd.DataFrame.compare
.
>>> df1 = pd.DataFrame({"col1": [1, 1, 2, 2, 3, 3], "col2": list("ABCDEF")})
>>> df2 = pd.DataFrame({"col1": [1, 1, 2, 2, 3, 3], "col2": list("ABDCEK")})
>>> df1.compare(df2)
col2
self other
2 C D
3 D C
5 F K