I have a dataset called df_authors and in that dataset I have a column called author. I have to verify that df_authors.author is a unique identifier.
What I tried, len(df_authors) == len(df_authors['author'].unique())
, and this returns True.
My question is have I done this right. I found this line of code online and not a 100% sure if it does what I think it does.
My understanding of a unique identifier is that, each row in that column has unique values and this line of code checking each row against the dataset to be unique or not.
If someone could tell me if I'm right or way of here, I'd greatly appreciated it. Thank you.
CodePudding user response:
Your understanding of a unique identifier is correct, however this line of code works a bit differently:
len(df_authors)
gives you the number of lines in the DataFrame. len(df_authors['author'].unique())
gives you the number of unique values in the author
column. If both lengths are same, that necessarily means that author
is unique.
You can also leverage pandas more directly by using set_index
:
df_with_index = df_authors.set_index('author', verify_integrity=True)
If the index is not unique, that statement will fail (because of verify_integrity
), plus you will be able to use the author as an index, e.g.:
df_with_index.loc[author]