Home > front end >  How to select all rows if no conditions are met within a group and select a subset of rows if certai
How to select all rows if no conditions are met within a group and select a subset of rows if certai

Time:10-05

In the dataframe below:

pd.DataFrame({'customer': ['cust1', 'cust1', 'cust1', 'cust2', 'cust2', 'cust3', 'cust3', 'cust4', 'cust4'],
                   'year': [2017, 2018, 2019, 2018, 2019, 2017, 2018, 2018, 2019],
                   'score': [0.10, 0.59, 0.3, 0.44, 0.2, 0.78, 0.6, 0.37, .023]})

    customer    year    score
0   cust1   2017    0.100
1   cust1   2018    0.590
2   cust1   2019    0.300
3   cust2   2018    0.440
4   cust2   2019    0.200
5   cust3   2017    0.780
6   cust3   2018    0.600
7   cust4   2018    0.370
8   cust4   2019    0.023

I want to filter the data within each group of customers. the conditions are:

if the score >= 0.5: return only rows greater than 0.5 in that group
if no score is greater than 0.5 in a group: return all the rows

The result should look like this:

    customer    year    cond
0   cust1   2018    0.590
1   cust2   2018    0.440
2   cust2   2019    0.200
3   cust3   2017    0.780
4   cust3   2018    0.600
5   cust4   2018    0.370
6   cust4   2019    0.023

CodePudding user response:

Chain 2 conditions - first mask is for test if greater or equal by Series.ge and for second get all customer if no match condition m:

m = df['score'].ge(0.5)
df = df[m | ~df['customer'].isin(df.loc[m, 'customer'])]
print (df)
  customer  year  score
1    cust1  2018  0.590
3    cust2  2018  0.440
4    cust2  2019  0.200
5    cust3  2017  0.780
6    cust3  2018  0.600
7    cust4  2018  0.370
8    cust4  2019  0.023

Details:

print (df.loc[m, 'customer'])
1    cust1
5    cust3
6    cust3
Name: customer, dtype: object

print (~df['customer'].isin(df.loc[m, 'customer']))
0    False
1    False
2    False
3     True
4     True
5    False
6    False
7     True
8     True
Name: customer, dtype: bool

Or if performance is no important use for second mask GroupBy.transform with GroupBy.any - should be slow in large DataFrames:

m = df['score'].ge(0.5)
df = df[m | ~m.groupby(df['customer']).transform('any')]
print (df)
  customer  year  score
1    cust1  2018  0.590
3    cust2  2018  0.440
4    cust2  2019  0.200
5    cust3  2017  0.780
6    cust3  2018  0.600
7    cust4  2018  0.370
8    cust4  2019  0.023

Details:

print (~m.groupby(df['customer']).transform('any'))
0    False
1    False
2    False
3     True
4     True
5    False
6    False
7     True
8     True
Name: score, dtype: bool

CodePudding user response:

You can use two masks for boolean indexing:

# is the score ≥ 0.5?
m1 = df['score'].ge(0.5)
# are none of values in the group ≥ 0.5
m2 = ~m1.groupby(df['customer']).transform('any')

# select if any condition matches
out = df[m1|m2]

Output:

  customer  year  score
1    cust1  2018  0.590
3    cust2  2018  0.440
4    cust2  2019  0.200
5    cust3  2017  0.780
6    cust3  2018  0.600
7    cust4  2018  0.370
8    cust4  2019  0.023
  • Related