Home > Enterprise >  How can "unique" show duplicate values in a dataframe?
How can "unique" show duplicate values in a dataframe?

Time:04-12

Background: I am very confused by my dataframe (df), which when I do some simple analyses is producing random rows for a specific value within my column named 'ID' (specifically, when ID == 42). As a result, I have started to do some troubleshooting.

When I try to list all the rows where ID = 42, I do:

data=df.loc[df['ID'] == 42]

And the rows look correct in this new variable called 'data'. However, when I scroll manually through the original dataframe df (e.g., in the Variable Explorer on Spyder), I can see there are way more rows for ID=42 that are not being printed to 'data'.

Then, to double check why the 'ID' values are showing this weird behavior, I did

print(df['ID'].unique())

And, weirdly, I get this:

[ 20. 31. 42. 42. 84. 142. 198. 248. 280. 288. 352. 378. 459. 498.] -- note that 42 is repeated!

My question is, how can there be two 42s when I use the .unique() function? I thought it was supposed to output all the unique values? If I could understand this better, I could start to understand the rest of the problems that ensue...

Am I missing something about how 'unique' works?

Ps. My files are big so I haven't included them, but if I need to provide more (numerical) context please let me know.

Thanks!

CodePudding user response:

Moving my comment to an answer, as it solved the problem:

print(df['ID'].astype(int).unique())
  • Related