I have a data frame and around 10-12 columns. One of the column is the student number e.g. 1234567 and the other is an identifier e.g passport numbers, license number . How can I find that each student has a unique identifier. Like student 1234567 has identifier ABC5679K only. Also I want to store the students who are tagged with duplicate identifier. For e.g. If student 1234567 also has identifier ABC3408T, I want to know those.
CodePudding user response:
Assuming such input, where the first student has a unique other id, while the second has two different ids:
student_id other_id
0 1 a
1 1 a
2 2 b
3 2 c
you can use GroupBy.transform
with nunique
for that:
mask = df.groupby('student_id')['other_id'].transform('nunique').gt(1)
print(df[mask])
output:
student_id other_id
2 2 b
3 2 c
Or, for a list of all students with duplicated other ids:
s = df.groupby('student_id')['other_id'].nunique()
s[s>1].index.to_list()
output: [2]
CodePudding user response:
df.groupby(["student_name"])["passport_number"].nunique() > 1
You can use the groupby and nunique function to help you identify repeats. Hope this answer your question.