Home > Mobile >  Pandas - List of Duplicates but Only Maintain Rows with Same Value Across All Duplicates
Pandas - List of Duplicates but Only Maintain Rows with Same Value Across All Duplicates

Time:12-09

Working on some data cleaning here.

I'm having a tough time parsing out duplicated rows where the value in column B is the same across all duplicated instances for column A. I want to keep the duplicates, and then I will drop the duplicates in a future step so I end up with one row per for each combo of col A/col B. For now, I just want the rows that I'm interested in.

Example dataframe:

A     B
1     1
1     1
1     2
1     2
2     5
2     5
2     5

In the example above, I only want to maintain the rows with 2 and 5 because the value of 5 is the same value across all instances of 2. When you look at the instance of 1, it technically has duplicates, but it has 2 different values across all instances of 1 (1 and 2).

Example of what I'm trying to achieve:

A     B
2     5
2     5
2     5

Since I would also drop the duplicates after I get to that point, this is what I'd ideally end up with in case that changes the recommended solution:

A     B
2     5

Here is the code I'm using which at least brought me to this instance: updates_dupes = updates_df[updates_df.duplicated(["A", "B"])]

Any thoughts?

CodePudding user response:

new_df = df.groupby('A').filter(lambda group: group['B'].nunique() == 1).drop_duplicates()

Output:

>>> df
   A  B
0  1  1
1  1  1
2  1  2
3  1  2
4  2  5
5  2  5
6  2  5

>>> df.groupby('A').filter(lambda group: group['B'].nunique() == 1)
   A  B
4  2  5
5  2  5
6  2  5

>>> df.groupby('A').filter(lambda group: group['B'].nunique() == 1).drop_duplicates()
   A  B
4  2  5
  • Related