Home > OS >  Python finding duplicates more than 2 occurrences in all columns
Python finding duplicates more than 2 occurrences in all columns

Time:07-25

I have a dataframe with 2 columns

id  Date
1   2022-01-01
1   2022-01-01
1   2022-01-01
1   2022-01-02
1   2022-01-02
1   2022-01-02
2   2022-01-01
2   2022-01-01
3   2022-01-01
3   2022-01-01
3   2022-01-01
3   2022-01-01

I need to find the duplicated rows (or duplicated IDs) when there are same values in all columns and those rows should appear more than twice.

The result should be like this

id  Date
1   2022-01-01
1   2022-01-02
3   2022-01-01

CodePudding user response:

Here's a way to find rows duplicated more than twice:

df2 = df.assign(x=0).groupby(list(df.columns)).count().query('x>2').drop(columns='x').reset_index()

Ouput:

   id        Date
0   1  2022-01-01
1   1  2022-01-02
2   3  2022-01-01

Explanation:

  • add a temporary column x
  • use groupby().count() to obtain unique combinations of the id, Date key with the number of occurrences per group in column x
  • use query() to filter for rows that occur more than twice using the condition x>2
  • drop the temporary column x and use reset_index() to restore id and Date as columns.

CodePudding user response:

You can also use .value_counts instead of groupby, while the rest is similar to the answer of @constantstranger:

(df
   # count duplicate rows
   .value_counts()

   # convert to DataFrame to query the number of occurrences later on
   .to_frame()

   # MultiIndex to column and renaming
   .reset_index().rename(columns={0:'count'})

   # Filter out only 3  duplicates
   .query('count > 2')

   # Remove the temporary columns
   .drop(columns='count')
)

giving

   id        Date                                                                                 
0   3  2022-01-01                                                                                 
1   1  2022-01-01                                                                                 
2   1  2022-01-02   
  • Related