Home > Blockchain >  Find duplicate rows in a group, where another row exists in the same group that matches some conditi
Find duplicate rows in a group, where another row exists in the same group that matches some conditi

Time:04-05

I want to find rows which are duplicates according to a certain subset of column values, where those rows also have a zero value in a certain column C and another row exists in the same group with a nonzero value for C.

Test data:

df = pd.DataFrame([
        # Should be marked as the combination is available in another
        # non-zero price.
        {
            'colour': 'blue',
            'flavour': 'strawberry',
            'price': 0
        },
        {
            'colour': 'blue',
            'flavour': 'strawberry',
            'price': 40,
        },
        {
            'colour': 'blue',
            'flavour': 'vanilla',
            'price': 50,
        },
        # Should not be marked as it's the only way to get this
        # combination.
        {
            'colour': 'red',
            'flavour': 'strawberry',
            'price': 0,
        }
    ]
)

Here the subset of columns is ['colour', 'flavour'] and the column is price.

What I've tried: I have a working version that uses apply() with a function, but I would like a vectorized solution.

What makes intuitive sense to me would be something like this:

nonzero_agg = df[df['price'] != 0].groupby(['colour', 'flavour']).agg({'price': 'max'})
df['marked'] = (df['price'] == 0) & (df[['colour', 'flavour']].isin(nonzero_agg))

But I think this use of isin() doesn't work, I get ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'.

I want the resulting dataframe to look like this:

  colour     flavour  price  marked
0   blue  strawberry      0    True
1   blue  strawberry     40   False
2   blue     vanilla     50   False
3    red  strawberry      0   False

CodePudding user response:

Test if price is 0 and at least one value non 0 per groups by GroupBy.transform with GroupBy.any and chain conditions:

m = df['price'] == 0

df['new'] = df.assign(tmp=~m).groupby(['colour', 'flavour'])['tmp'].transform('any') & m
print (df)
  colour     flavour  price    new
0   blue  strawberry      0   True
1   blue  strawberry     40  False
2   blue     vanilla     50  False
3    red  strawberry      0  False
  • Related