Home > database >  Compare the values of 2 dataframe and create a new dataframe based on the values match and mismatch
Compare the values of 2 dataframe and create a new dataframe based on the values match and mismatch

Time:02-17

I have a dataframe df1.

df1 = pd.DataFrame([["A","X",5,4,1],["A","Y",3,1,3],["B","X",4,7,4],["B","W",3,9,3],["C","Z",7,4,5],["C","Y",1,0,6],["D","P",8,4,7],["D","Q",7,2,2]], columns=['col1', 'col2', 'col3', 'col4','col5'])
  col1 col2  col3  col4  col5
0    A    X     5     4     1
1    A    Y     3     1     3
2    B    X     4     7     4
3    B    W     3     9     3
4    C    Z     7     4     5
5    C    Y     1     0     6
6    D    P     8     4     7
7    D    Q     7     2     2

I have another dataframe df2.

df2 = pd.DataFrame([["B","W",3,7,3],["B","X",4,7,5],["C","Z",8,4,6],["C","Y",1,0,6]], columns=['col1', 'col2', 'col3', 'col4','col5'])

All the rows present in df1 are not present in df2 and their row order is different.

  col1 col2  col3  col4  col5
0    B    W     3     7     3
1    B    X     4     7     5
2    C    Z     8     4     6
3    C    Y     1     0     6

I want to compare the values of particular rows of 2 dataframe. If the value present in both the dataframe is the same, make it True otherwise False.

Expected Output:

Out = pd.DataFrame([["B","W",True,False,True],["B","X",True,True,False],["C","Z",False,True,False],["C","Y",True,True,True]], columns=['col1', 'col2', 'col3', 'col4','col5'])
  col1 col2   col3   col4   col5
0    B    W   True  False   True
1    B    X   True   True  False
2    C    Y   True   True   True
3    C    Z  False   True  False

How to do it?

CodePudding user response:

IIUC, you could do:

# set reference columns
cols = ['col1', 'col2']

# set references as index to align the data and compare
Out = df1.set_index(cols).eq(df2.set_index(cols))

# keep only rows where there is at least one True
# and restore the references as columns
Out = Out[Out.any(axis=1)].reset_index()

output:

  col1 col2   col3   col4   col5
0    B    W   True  False   True
1    B    X   True   True  False
2    C    Y   True   True   True
3    C    Z  False   True  False

CodePudding user response:

It's also possible to do this:

cols = ['col1', 'col2']

# concat both dataframes and creating a new unique index
c_df = pd.concat([df1, df2], ignore_index=True)

# Described after this snippet
Out = c_df[cols].join(~c_df.groupby(cols).diff().dropna().astype(bool), how='inner')
  • You're grouping rows by your reference columns, performing a diff between them.
  • In single row groups diff returns NaN and you don't want that, that's why you're dropping it.
  • The remaining values are numeric. If diff is zero, that means both columns are equal for that same row.
  • If you cast diff as booleans, you get False for equals values, and that's why you need to perform a negation (~)
  • There you go! You should just inner-join the new columns into a view of your reference columns from your concatenated data to match your output and that's it
  • Related