Home > Blockchain >  Strange results for Dataframe.duplicated() | Pandas
Strange results for Dataframe.duplicated() | Pandas

Time:10-07

I was trying to find duplicates in a column in dataframe of dtype string[python] When I run x['comment'].duplicated(), I get the following output

1         False
2         False
3         False
4         False
          ...  
155071     True
155072     True
155073     True
155074     True
155075     True

I then decide to see what the contents of the strings are using x['comment'].iloc[155071:155075]. This gives me the following output

155071    Hola. @strange67 en mí debut, coincidió que se...
155072    Yo estuve altísima un tiempo ( la glico de 16 ...
155073    Hola strange67,yo cuando debute hace 32 anos s...
155074    No soy medico ni nada pero partiendo de eso, h...
Name: comment, dtype: string

As you can see the string are not at all duplicates. So I rerun the duplicated function only for these few rows; x['comment'].iloc[155071:155075].duplicated(). It now says that the contents are not duplicate

155071    False
155072    False
155073    False
155074    False
Name: comment, dtype: bool

Am I doing something wrong or interpreting the way duplicated() works in the wrong way? How do I find duplicate records? Because drop_duplicates() function removes a major chunk of my non-duplicate data.

CodePudding user response:

From the docs, pandas.Series.Duplicated looks for duplicates anywhere in the column. You used the default keep="first" parameter, so the first occurrence is not flagged. Each of these strings are duplicated somewhere above.

The same thing is happening in this simplified example

>>> df=pd.DataFrame({"a":[1,2,3,1,2,3,1,2,3]})
>>> df['a'].duplicated()
0    False
1    False
2    False
3     True
4     True
5     True
6     True
7     True
8     True
dtype: bool

You could do x['comment'] == 'Hola. @strange67 en mí debut, coincidió que se... to find all of the matches for a single value. Or x['comment'].duplicated(keep=False) to mark all of them.

  • Related