Home > Blockchain >  Filter dataframe on two interdependent columns
Filter dataframe on two interdependent columns

Time:03-24

We have a minimum example dataframe:

    data = [[1, 100],
            [1, 100],
            [2, 100],
            [2, 101],
            [3, 100],
            [4, 101],
            [4, 101]
           ]
    
    pd.DataFrame(data, columns=['A', 'B'])

I want to filter for col A where it is duplicated but only where col B is not duplicated for that subset.

For this dataframe the output I would expect is:

    [[2, 100],
     [2, 101]
    ]

I tried using a df.duplicated with other functions (making it easy to drop records like the single 3) but although I can meet the first part I'm not sure how to then do a windowed filter to further filter the data.

This is a minimum example but there are millions of records which is why i was looking for a vectorised solution.

CodePudding user response:

Another solution:

output = df[df['B'].groupby(df['A']).transform('nunique') > 1]

CodePudding user response:

You can use a combination of groupby and duplicated:

filtered = df[df['A'].duplicated(keep=False) & df.groupby('A')['B'].apply(lambda x: ~x.duplicated().any())]

Output:

>>> filtered
   A    B
2  2  100
3  2  101

Alternate solution:

filtered = df[df.groupby('A', as_index=False).apply(lambda g: g.apply(lambda col: col.duplicated(keep=False))).sum(axis=1) == 1]
  • Related