I have a CSV file with around 50MB in size. This is what the sample DataFrame looks like:
TITLE DESCRIPTION
android [email protected]
python [email protected]
android [email protected]
android [email protected]
Php [email protected]
I am trying to remove all the duplicates from the dataframe only if the TITLE is occurred more than 2 times, so the output
will look like this :
TITLE DESCRIPTION
android [email protected]
python [email protected]
android [email protected]
Php [email protected]
This is the code i am using right now :
df.drop_duplicates(subset='TITLE', inplace=True, keep=False)
But, the issue is it removes all the duplicates and only keeps a single
occurrence of the Title, how can i make it to remove only if a title has occured more than 2 times
, please post the solution. Thanks
CodePudding user response:
Let us try groupby
cumcount
to create a sequential counter per TITLE
, then select the rows where the value of this counter is less than 1
df[df.groupby('TITLE').cumcount().le(1)]
TITLE DESCRIPTION
0 android [email protected]
1 python [email protected]
2 android [email protected]
4 Php [email protected]