We have a minimum example dataframe:
data = [[1, 100],
[1, 100],
[2, 100],
[2, 101],
[3, 100],
[4, 101],
[4, 101]
]
pd.DataFrame(data, columns=['A', 'B'])
I want to filter for col A where it is duplicated but only where col B is not duplicated for that subset.
For this dataframe the output I would expect is:
[[2, 100],
[2, 101]
]
I tried using a df.duplicated with other functions (making it easy to drop records like the single 3) but although I can meet the first part I'm not sure how to then do a windowed filter to further filter the data.
This is a minimum example but there are millions of records which is why i was looking for a vectorised solution.
CodePudding user response:
Another solution:
output = df[df['B'].groupby(df['A']).transform('nunique') > 1]
CodePudding user response:
You can use a combination of groupby
and duplicated
:
filtered = df[df['A'].duplicated(keep=False) & df.groupby('A')['B'].apply(lambda x: ~x.duplicated().any())]
Output:
>>> filtered
A B
2 2 100
3 2 101
Alternate solution:
filtered = df[df.groupby('A', as_index=False).apply(lambda g: g.apply(lambda col: col.duplicated(keep=False))).sum(axis=1) == 1]