How to Drop Duplicates if the rows is present more than twice?-CodePudding

I have a CSV file with around 50MB in size. This is what the sample DataFrame looks like:

TITLE     DESCRIPTION
android   [email protected]
python    [email protected]
android   [email protected]
android   [email protected]
Php       [email protected]

I am trying to remove all the duplicates from the dataframe only if the TITLE is occurred more than 2 times, so the output will look like this :

TITLE     DESCRIPTION
android   [email protected]
python    [email protected]
android   [email protected]
Php       [email protected]

This is the code i am using right now :

df.drop_duplicates(subset='TITLE',  inplace=True, keep=False)

But, the issue is it removes all the duplicates and only keeps a single occurrence of the Title, how can i make it to remove only if a title has occured more than 2 times, please post the solution. Thanks

CodePudding user response：

Let us try groupby cumcount to create a sequential counter per TITLE, then select the rows where the value of this counter is less than 1

df[df.groupby('TITLE').cumcount().le(1)]

     TITLE          DESCRIPTION
0  android    [email protected]
1   python     [email protected]
2  android  [email protected]
4      Php        [email protected]