I am new to Python and would like to seek you help on this please. I would like to find out the non-matching rows between 2 dataframes i.e. df1 and df2 with thousands of rows. They both contain the same number of columns with the same name.
df2 has 10 entries lesser than df1 which I am trying to find out what they are.
I have tried pd.concat([df1,df2]).drop_duplicates(keep=False)
but it returns zero result.
What could be the reason? Any help/advice would be much appreciated. Thanks a lot.
CodePudding user response:
The following code will remove rows in df1
that are present in df2
df1[~df1.isin(df2)]
CodePudding user response:
concat
combines two frames. You're trying to find the difference between two frames. This can be done with compare. As the doc example shows, given these two frames:
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
You can find the different rows with compare
:
df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
In this case compare
returns one row for every row that has a difference and only the columns that are actually different.
compare
can return the equal values or all original values as well:
df.compare(df2, keep_equal=True)
col1 col3
self other self other
0 a c 1.0 1.0
2 b b 3.0 4.0
or
df.compare(df2, keep_shape=True, keep_equal=True)
col1 col2 col3
self other self other self other
0 a c 1.0 1.0 1.0 1.0
1 a a 2.0 2.0 2.0 2.0
2 b b 3.0 3.0 3.0 4.0
3 b b NaN NaN 4.0 4.0
4 a a 5.0 5.0 5.0 5.0
CodePudding user response:
Just because you have 10 less entries doesn't mean you're going to find duplicates. You probably already have duplicates inside the first dataframe.
Demo:
# Length: 7
df1 = pd.DataFrame({'col1': list('AAABCDE'),
'col2': list('FFFGHIJ'),
'col3': list('1112345')})
# Length: 5
df2 = pd.DataFrame({'col1': list('ABCDE'),
'col2': list('FGHIJ'),
'col3': list('12345')})
Your code:
>>> pd.concat([df1,df2]).drop_duplicates(keep=False)
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
Try:
>>> len(df1.drop_duplicates())
5
>>> len(df2.drop_duplicates())
5
CodePudding user response:
Assuming both df1 and df2 are Pandas Dataframe, the following code returns True for matching rows and false for the other:
print((df1 == df2).any(1))
If needed to check each and every columns in all the rows, try this:
print((df1 == df2).stack())