Background: I am very confused by my dataframe (df), which when I do some simple analyses is producing random rows for a specific value within my column named 'ID' (specifically, when ID == 42). As a result, I have started to do some troubleshooting.
When I try to list all the rows where ID = 42, I do:
data=df.loc[df['ID'] == 42]
And the rows look correct in this new variable called 'data'. However, when I scroll manually through the original dataframe df (e.g., in the Variable Explorer on Spyder), I can see there are way more rows for ID=42 that are not being printed to 'data'.
Then, to double check why the 'ID' values are showing this weird behavior, I did
print(df['ID'].unique())
And, weirdly, I get this:
[ 20. 31. 42. 42. 84. 142. 198. 248. 280. 288. 352. 378. 459. 498.] -- note that 42 is repeated!
My question is, how can there be two 42s when I use the .unique() function? I thought it was supposed to output all the unique values? If I could understand this better, I could start to understand the rest of the problems that ensue...
Am I missing something about how 'unique' works?
Ps. My files are big so I haven't included them, but if I need to provide more (numerical) context please let me know.
Thanks!
CodePudding user response:
Moving my comment to an answer, as it solved the problem:
print(df['ID'].astype(int).unique())