Compare the values of 2 dataframe and create a new dataframe based on the values match and mismatch-CodePudding

I have a dataframe df1.

df1 = pd.DataFrame([["A","X",5,4,1],["A","Y",3,1,3],["B","X",4,7,4],["B","W",3,9,3],["C","Z",7,4,5],["C","Y",1,0,6],["D","P",8,4,7],["D","Q",7,2,2]], columns=['col1', 'col2', 'col3', 'col4','col5'])

  col1 col2  col3  col4  col5
0    A    X     5     4     1
1    A    Y     3     1     3
2    B    X     4     7     4
3    B    W     3     9     3
4    C    Z     7     4     5
5    C    Y     1     0     6
6    D    P     8     4     7
7    D    Q     7     2     2

I have another dataframe df2.

df2 = pd.DataFrame([["B","W",3,7,3],["B","X",4,7,5],["C","Z",8,4,6],["C","Y",1,0,6]], columns=['col1', 'col2', 'col3', 'col4','col5'])

All the rows present in df1 are not present in df2 and their row order is different.

  col1 col2  col3  col4  col5
0    B    W     3     7     3
1    B    X     4     7     5
2    C    Z     8     4     6
3    C    Y     1     0     6

I want to compare the values of particular rows of 2 dataframe. If the value present in both the dataframe is the same, make it True otherwise False.

Expected Output:

Out = pd.DataFrame([["B","W",True,False,True],["B","X",True,True,False],["C","Z",False,True,False],["C","Y",True,True,True]], columns=['col1', 'col2', 'col3', 'col4','col5'])

  col1 col2   col3   col4   col5
0    B    W   True  False   True
1    B    X   True   True  False
2    C    Y   True   True   True
3    C    Z  False   True  False

How to do it?

CodePudding user response：

IIUC, you could do:

# set reference columns
cols = ['col1', 'col2']

# set references as index to align the data and compare
Out = df1.set_index(cols).eq(df2.set_index(cols))

# keep only rows where there is at least one True
# and restore the references as columns
Out = Out[Out.any(axis=1)].reset_index()

output:

  col1 col2   col3   col4   col5
0    B    W   True  False   True
1    B    X   True   True  False
2    C    Y   True   True   True
3    C    Z  False   True  False

CodePudding user response：

It's also possible to do this:

cols = ['col1', 'col2']

# concat both dataframes and creating a new unique index
c_df = pd.concat([df1, df2], ignore_index=True)

# Described after this snippet
Out = c_df[cols].join(~c_df.groupby(cols).diff().dropna().astype(bool), how='inner')

You're grouping rows by your reference columns, performing a diff between them.
In single row groups diff returns NaN and you don't want that, that's why you're dropping it.
The remaining values are numeric. If diff is zero, that means both columns are equal for that same row.
If you cast diff as booleans, you get False for equals values, and that's why you need to perform a negation (~)
There you go! You should just inner-join the new columns into a view of your reference columns from your concatenated data to match your output and that's it