I have a dataframe df1.
df1 = pd.DataFrame([["A","X",5,4,1],["A","Y",3,1,3],["B","X",4,7,4],["B","W",3,9,3],["C","Z",7,4,5],["C","Y",1,0,6],["D","P",8,4,7],["D","Q",7,2,2]], columns=['col1', 'col2', 'col3', 'col4','col5'])
col1 col2 col3 col4 col5
0 A X 5 4 1
1 A Y 3 1 3
2 B X 4 7 4
3 B W 3 9 3
4 C Z 7 4 5
5 C Y 1 0 6
6 D P 8 4 7
7 D Q 7 2 2
I have another dataframe df2.
df2 = pd.DataFrame([["B","W",3,7,3],["B","X",4,7,5],["C","Z",8,4,6],["C","Y",1,0,6]], columns=['col1', 'col2', 'col3', 'col4','col5'])
All the rows present in df1 are not present in df2 and their row order is different.
col1 col2 col3 col4 col5
0 B W 3 7 3
1 B X 4 7 5
2 C Z 8 4 6
3 C Y 1 0 6
I want to compare the values of particular rows of 2 dataframe. If the value present in both the dataframe is the same, make it True otherwise False.
Expected Output:
Out = pd.DataFrame([["B","W",True,False,True],["B","X",True,True,False],["C","Z",False,True,False],["C","Y",True,True,True]], columns=['col1', 'col2', 'col3', 'col4','col5'])
col1 col2 col3 col4 col5
0 B W True False True
1 B X True True False
2 C Y True True True
3 C Z False True False
How to do it?
CodePudding user response:
IIUC, you could do:
# set reference columns
cols = ['col1', 'col2']
# set references as index to align the data and compare
Out = df1.set_index(cols).eq(df2.set_index(cols))
# keep only rows where there is at least one True
# and restore the references as columns
Out = Out[Out.any(axis=1)].reset_index()
output:
col1 col2 col3 col4 col5
0 B W True False True
1 B X True True False
2 C Y True True True
3 C Z False True False
CodePudding user response:
It's also possible to do this:
cols = ['col1', 'col2']
# concat both dataframes and creating a new unique index
c_df = pd.concat([df1, df2], ignore_index=True)
# Described after this snippet
Out = c_df[cols].join(~c_df.groupby(cols).diff().dropna().astype(bool), how='inner')
- You're grouping rows by your reference columns, performing a diff between them.
- In single row groups
diff
returnsNaN
and you don't want that, that's why you're dropping it. - The remaining values are numeric. If diff is zero, that means both columns are equal for that same row.
- If you cast diff as booleans, you get
False
for equals values, and that's why you need to perform a negation (~
) - There you go! You should just inner-join the new columns into a view of your reference columns from your concatenated data to match your output and that's it