In the dataframe below:
pd.DataFrame({'customer': ['cust1', 'cust1', 'cust1', 'cust2', 'cust2', 'cust3', 'cust3', 'cust4', 'cust4'],
'year': [2017, 2018, 2019, 2018, 2019, 2017, 2018, 2018, 2019],
'score': [0.10, 0.59, 0.3, 0.44, 0.2, 0.78, 0.6, 0.37, .023]})
customer year score
0 cust1 2017 0.100
1 cust1 2018 0.590
2 cust1 2019 0.300
3 cust2 2018 0.440
4 cust2 2019 0.200
5 cust3 2017 0.780
6 cust3 2018 0.600
7 cust4 2018 0.370
8 cust4 2019 0.023
I want to filter the data within each group of customers. the conditions are:
if the score >= 0.5: return only rows greater than 0.5 in that group
if no score is greater than 0.5 in a group: return all the rows
The result should look like this:
customer year cond
0 cust1 2018 0.590
1 cust2 2018 0.440
2 cust2 2019 0.200
3 cust3 2017 0.780
4 cust3 2018 0.600
5 cust4 2018 0.370
6 cust4 2019 0.023
CodePudding user response:
Chain 2 conditions - first mask is for test if greater or equal by Series.ge
and for second get all customer
if no match condition m
:
m = df['score'].ge(0.5)
df = df[m | ~df['customer'].isin(df.loc[m, 'customer'])]
print (df)
customer year score
1 cust1 2018 0.590
3 cust2 2018 0.440
4 cust2 2019 0.200
5 cust3 2017 0.780
6 cust3 2018 0.600
7 cust4 2018 0.370
8 cust4 2019 0.023
Details:
print (df.loc[m, 'customer'])
1 cust1
5 cust3
6 cust3
Name: customer, dtype: object
print (~df['customer'].isin(df.loc[m, 'customer']))
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 True
8 True
Name: customer, dtype: bool
Or if performance is no important use for second mask GroupBy.transform
with GroupBy.any
- should be slow in large DataFrames:
m = df['score'].ge(0.5)
df = df[m | ~m.groupby(df['customer']).transform('any')]
print (df)
customer year score
1 cust1 2018 0.590
3 cust2 2018 0.440
4 cust2 2019 0.200
5 cust3 2017 0.780
6 cust3 2018 0.600
7 cust4 2018 0.370
8 cust4 2019 0.023
Details:
print (~m.groupby(df['customer']).transform('any'))
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 True
8 True
Name: score, dtype: bool
CodePudding user response:
You can use two masks for boolean indexing:
# is the score ≥ 0.5?
m1 = df['score'].ge(0.5)
# are none of values in the group ≥ 0.5
m2 = ~m1.groupby(df['customer']).transform('any')
# select if any condition matches
out = df[m1|m2]
Output:
customer year score
1 cust1 2018 0.590
3 cust2 2018 0.440
4 cust2 2019 0.200
5 cust3 2017 0.780
6 cust3 2018 0.600
7 cust4 2018 0.370
8 cust4 2019 0.023