Non-matching rows between two DataFrames-CodePudding

I am new to Python and would like to seek you help on this please. I would like to find out the non-matching rows between 2 dataframes i.e. df1 and df2 with thousands of rows. They both contain the same number of columns with the same name.

df2 has 10 entries lesser than df1 which I am trying to find out what they are. I have tried pd.concat([df1,df2]).drop_duplicates(keep=False) but it returns zero result.

What could be the reason? Any help/advice would be much appreciated. Thanks a lot.

CodePudding user response：

The following code will remove rows in df1 that are present in df2

df1[~df1.isin(df2)]

CodePudding user response：

concat combines two frames. You're trying to find the difference between two frames. This can be done with compare. As the doc example shows, given these two frames:

df = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
    },
    columns=["col1", "col2", "col3"],
)

df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0

You can find the different rows with compare :

df.compare(df2)
  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

In this case compare returns one row for every row that has a difference and only the columns that are actually different.

compare can return the equal values or all original values as well:

df.compare(df2, keep_equal=True)
  col1       col3
  self other self other
0    a     c  1.0   1.0
2    b     b  3.0   4.0

df.compare(df2, keep_shape=True, keep_equal=True)
  col1       col2       col3
  self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0

CodePudding user response：

Just because you have 10 less entries doesn't mean you're going to find duplicates. You probably already have duplicates inside the first dataframe.

Demo:

# Length: 7
df1 = pd.DataFrame({'col1': list('AAABCDE'),
                    'col2': list('FFFGHIJ'),
                    'col3': list('1112345')})

# Length: 5
df2 = pd.DataFrame({'col1': list('ABCDE'),
                    'col2': list('FGHIJ'),
                    'col3': list('12345')})

Your code:

>>> pd.concat([df1,df2]).drop_duplicates(keep=False)
Empty DataFrame
Columns: [col1, col2, col3]
Index: []

Try:

>>> len(df1.drop_duplicates())
5

>>> len(df2.drop_duplicates())
5

CodePudding user response：

Assuming both df1 and df2 are Pandas Dataframe, the following code returns True for matching rows and false for the other:

print((df1 == df2).any(1))

If needed to check each and every columns in all the rows, try this:

print((df1 == df2).stack())