Working on some data cleaning here.
I'm having a tough time parsing out duplicated rows where the value in column B is the same across all duplicated instances for column A. I want to keep the duplicates, and then I will drop the duplicates in a future step so I end up with one row per for each combo of col A/col B. For now, I just want the rows that I'm interested in.
Example dataframe:
A B
1 1
1 1
1 2
1 2
2 5
2 5
2 5
In the example above, I only want to maintain the rows with 2 and 5 because the value of 5 is the same value across all instances of 2. When you look at the instance of 1, it technically has duplicates, but it has 2 different values across all instances of 1 (1 and 2).
Example of what I'm trying to achieve:
A B
2 5
2 5
2 5
Since I would also drop the duplicates after I get to that point, this is what I'd ideally end up with in case that changes the recommended solution:
A B
2 5
Here is the code I'm using which at least brought me to this instance:
updates_dupes = updates_df[updates_df.duplicated(["A", "B"])]
Any thoughts?
CodePudding user response:
new_df = df.groupby('A').filter(lambda group: group['B'].nunique() == 1).drop_duplicates()
Output:
>>> df
A B
0 1 1
1 1 1
2 1 2
3 1 2
4 2 5
5 2 5
6 2 5
>>> df.groupby('A').filter(lambda group: group['B'].nunique() == 1)
A B
4 2 5
5 2 5
6 2 5
>>> df.groupby('A').filter(lambda group: group['B'].nunique() == 1).drop_duplicates()
A B
4 2 5