Home > Mobile >  Non-matching rows between two DataFrames
Non-matching rows between two DataFrames

Time:03-04

I am new to Python and would like to seek you help on this please. I would like to find out the non-matching rows between 2 dataframes i.e. df1 and df2 with thousands of rows. They both contain the same number of columns with the same name.

df2 has 10 entries lesser than df1 which I am trying to find out what they are. I have tried pd.concat([df1,df2]).drop_duplicates(keep=False) but it returns zero result.

What could be the reason? Any help/advice would be much appreciated. Thanks a lot.

CodePudding user response:

The following code will remove rows in df1 that are present in df2

df1[~df1.isin(df2)]

CodePudding user response:

concat combines two frames. You're trying to find the difference between two frames. This can be done with compare. As the doc example shows, given these two frames:

df = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
    },
    columns=["col1", "col2", "col3"],
)

df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0

You can find the different rows with compare :

df.compare(df2)
  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

In this case compare returns one row for every row that has a difference and only the columns that are actually different.

compare can return the equal values or all original values as well:

df.compare(df2, keep_equal=True)
  col1       col3
  self other self other
0    a     c  1.0   1.0
2    b     b  3.0   4.0

or

df.compare(df2, keep_shape=True, keep_equal=True)
  col1       col2       col3
  self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0

CodePudding user response:

Just because you have 10 less entries doesn't mean you're going to find duplicates. You probably already have duplicates inside the first dataframe.

Demo:

# Length: 7
df1 = pd.DataFrame({'col1': list('AAABCDE'),
                    'col2': list('FFFGHIJ'),
                    'col3': list('1112345')})

# Length: 5
df2 = pd.DataFrame({'col1': list('ABCDE'),
                    'col2': list('FGHIJ'),
                    'col3': list('12345')})

Your code:

>>> pd.concat([df1,df2]).drop_duplicates(keep=False)
Empty DataFrame
Columns: [col1, col2, col3]
Index: []

Try:

>>> len(df1.drop_duplicates())
5

>>> len(df2.drop_duplicates())
5

CodePudding user response:

Assuming both df1 and df2 are Pandas Dataframe, the following code returns True for matching rows and false for the other:

print((df1 == df2).any(1))

If needed to check each and every columns in all the rows, try this:

print((df1 == df2).stack())
  • Related