I was trying to check a files containing duplicate data using pandas.
Student,"Details"
Joe|"December 2017|chemistry"
Bob|"April 2018|chemistry|Biology"
sam|"December 2018|physics"
I want to check any duplicate value in second column(Details),If it has any duplicate value then print the line with all the duplicate value .So here it should be
Joe|"December 2017|chemistry"
Bob|"April 2018|chemistry|Biology"
CodePudding user response:
Split the Details
columns by '|
', explode, check if value is duplicated, groupby index and use max aggregation to create a boolean mask. Use this mask to filter.
mask = (df['Details'].str.split('|')
.explode()
.duplicated(keep=False)
.groupby(level=0).max())
df[mask]
[out]
Student Details
0 Joe December 2017|chemistry
1 Bob April 2018|chemistry|Biology
CodePudding user response:
the pandas duplicated function should identify your duplicates.
pandas.duplicated(subset='Details')