I have a dataframe with 2 columns
id Date
1 2022-01-01
1 2022-01-01
1 2022-01-01
1 2022-01-02
1 2022-01-02
1 2022-01-02
2 2022-01-01
2 2022-01-01
3 2022-01-01
3 2022-01-01
3 2022-01-01
3 2022-01-01
I need to find the duplicated rows (or duplicated IDs) when there are same values in all columns and those rows should appear more than twice.
The result should be like this
id Date
1 2022-01-01
1 2022-01-02
3 2022-01-01
CodePudding user response:
Here's a way to find rows duplicated more than twice:
df2 = df.assign(x=0).groupby(list(df.columns)).count().query('x>2').drop(columns='x').reset_index()
Ouput:
id Date
0 1 2022-01-01
1 1 2022-01-02
2 3 2022-01-01
Explanation:
- add a temporary column
x
- use
groupby().count()
to obtain unique combinations of theid, Date
key with the number of occurrences per group in columnx
- use
query()
to filter for rows that occur more than twice using the conditionx>2
- drop the temporary column
x
and usereset_index()
to restoreid
andDate
as columns.
CodePudding user response:
You can also use .value_counts
instead of groupby
, while the rest is similar to the answer of @constantstranger:
(df
# count duplicate rows
.value_counts()
# convert to DataFrame to query the number of occurrences later on
.to_frame()
# MultiIndex to column and renaming
.reset_index().rename(columns={0:'count'})
# Filter out only 3 duplicates
.query('count > 2')
# Remove the temporary columns
.drop(columns='count')
)
giving
id Date
0 3 2022-01-01
1 1 2022-01-01
2 1 2022-01-02