I want to find rows which are duplicates according to a certain subset of column values, where those rows also have a zero value in a certain column C and another row exists in the same group with a nonzero value for C.
Test data:
df = pd.DataFrame([
# Should be marked as the combination is available in another
# non-zero price.
{
'colour': 'blue',
'flavour': 'strawberry',
'price': 0
},
{
'colour': 'blue',
'flavour': 'strawberry',
'price': 40,
},
{
'colour': 'blue',
'flavour': 'vanilla',
'price': 50,
},
# Should not be marked as it's the only way to get this
# combination.
{
'colour': 'red',
'flavour': 'strawberry',
'price': 0,
}
]
)
Here the subset of columns is ['colour', 'flavour']
and the column is price
.
What I've tried: I have a working version that uses apply()
with a function, but I would like a vectorized solution.
What makes intuitive sense to me would be something like this:
nonzero_agg = df[df['price'] != 0].groupby(['colour', 'flavour']).agg({'price': 'max'})
df['marked'] = (df['price'] == 0) & (df[['colour', 'flavour']].isin(nonzero_agg))
But I think this use of isin()
doesn't work, I get ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
.
I want the resulting dataframe to look like this:
colour flavour price marked
0 blue strawberry 0 True
1 blue strawberry 40 False
2 blue vanilla 50 False
3 red strawberry 0 False
CodePudding user response:
Test if price
is 0
and at least one value non 0
per groups by GroupBy.transform
with GroupBy.any
and chain conditions:
m = df['price'] == 0
df['new'] = df.assign(tmp=~m).groupby(['colour', 'flavour'])['tmp'].transform('any') & m
print (df)
colour flavour price new
0 blue strawberry 0 True
1 blue strawberry 40 False
2 blue vanilla 50 False
3 red strawberry 0 False