Home > database >  df.duplicated() not finding duplicates
df.duplicated() not finding duplicates

Time:01-06

I am trying to run this code.

import pandas as pd

df = pd.DataFrame({'A':['1','2'],
                   'B':['1','2'],
                   'C':['1','2']})
print(df.duplicated())

It is giving me the output.

0    False
1    False
dtype: bool

I want to know why it is showing index 1 as False and not True.

I'm expecting output this.

0    False
1    True
dtype: bool

I'm using Python 3.11.1 and Pandas 1.4.4

CodePudding user response:

duplicated is working on full rows (or a subset of the columns if the parameter is used).

Here you don't have any duplicate:

   A  B  C
0  1  1  1   # this row is unique
1  2  2  2   # this one is also unique

I believe you might want duplication column-wise?

df.T.duplicated()

Output:

A    False
B     True
C     True
dtype: bool

CodePudding user response:

You are not getting the expected output because you don't have duplicates, to begin with. I added the duplicate rows to the end of your dataframe and this is closer to what you are looking for:

import pandas as pd

df = pd.DataFrame({'A':['1','2'],
                   'B':['1','2'],
                   'C':['1','2']})
                   
df = pd.concat([df]*2)
df


    A   B   C
0   1   1   1
1   2   2   2
0   1   1   1
1   2   2   2

df.duplicated(keep='first')

Output:

0    False
1    False
0     True
1     True
dtype: bool

And the if you want to keep duplicates the other way around:

df.duplicated(keep='last')

0     True
1     True
0    False
1    False
dtype: bool
  • Related