Home > Mobile >  column comparison of two dataframe, return df with mismatches python
column comparison of two dataframe, return df with mismatches python

Time:10-04

I want to print two dataframes that print the rows where there is a mismatch in a given column, here the "second_column": "first_column" is a key value that identify same product in both dataframes

    import pandas as pd
    
    data1 = {
        'first_column': ['1', '2', '2'],
        'second_column': ['1', '2', '2'],
        'third_column': ['1', '2', '2'],
        'fourth_column': ['1', '2', '2']
    }
    
    df1 = pd.DataFrame(data1)
    
    print(df1)
    
    test = df1['second_column'].nunique()
    
    data2 = {
        'first_column': ['1', '2', '2'],
        'second_column': ['3', '4', '2'],
        'third_column': ['1', '2', '2'],
        'fourth_column': ['1', '2', '2']
    }
    
    df2 = pd.DataFrame(data2)
    print(df2)

enter image description here

expected output:

enter image description here

CodePudding user response:

the compare method can do what you want.

different_rows = df1.compare(df2, align_axis=1).index
df1.loc[different_rows]

With this method, one important point is if there are extra rows (index) then it will not return a difference.

or if you want to find differences in one column only, you can first join on the index then check if the join matches

joined_df = df1.join(df2['second_column'], rsuffix='_df2')
diff = joined_df['second_column']!=joined_df['second_column_df2']
print(joined_df.loc[diff, df1.columns])

CodePudding user response:

IIUC btw, you screenshots don't match your DF definition

df1.loc[~df1['second_column'].isin(df2['second_column'])]

first_column    second_column   third_column    fourth_column
0   1   1   1   1
df2.loc[~df2['second_column'].isin(df1['second_column'])]
first_column    second_column   third_column    fourth_column
0   1   3   1   1
1   2   4   2   2
  • Related