I was trying to find duplicates in a column in dataframe of dtype
string[python]
When I run
x['comment'].duplicated()
, I get the following output
1 False
2 False
3 False
4 False
...
155071 True
155072 True
155073 True
155074 True
155075 True
I then decide to see what the contents of the strings are using x['comment'].iloc[155071:155075]
. This gives me the following output
155071 Hola. @strange67 en mí debut, coincidió que se...
155072 Yo estuve altísima un tiempo ( la glico de 16 ...
155073 Hola strange67,yo cuando debute hace 32 anos s...
155074 No soy medico ni nada pero partiendo de eso, h...
Name: comment, dtype: string
As you can see the string are not at all duplicates. So I rerun the duplicated function only for these few rows; x['comment'].iloc[155071:155075].duplicated()
. It now says that the contents are not duplicate
155071 False
155072 False
155073 False
155074 False
Name: comment, dtype: bool
Am I doing something wrong or interpreting the way duplicated() works in the wrong way? How do I find duplicate records? Because drop_duplicates()
function removes a major chunk of my non-duplicate data.
CodePudding user response:
From the docs, pandas.Series.Duplicated looks for duplicates anywhere in the column. You used the default keep="first"
parameter, so the first occurrence is not flagged. Each of these strings are duplicated somewhere above.
The same thing is happening in this simplified example
>>> df=pd.DataFrame({"a":[1,2,3,1,2,3,1,2,3]})
>>> df['a'].duplicated()
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 True
dtype: bool
You could do x['comment'] == 'Hola. @strange67 en mí debut, coincidió que se...
to find all of the matches for a single value. Or x['comment'].duplicated(keep=False)
to mark all of them.