Filter according dynamic conditions that are dependent-CodePudding

I'm having a dataframe that looks like:

 ---- --------- --------- 
|    |   Count |   Value |
|---- --------- ---------|
|  0 |      10 |     0.5 |
|  1 |      17 |     0.9 |
|  2 |      56 |     0.6 |
|  3 |      25 |     0.7 |
|  4 |      80 |     0.7 |
|  5 |     190 |     0.6 |
|  6 |       3 |     0.8 |
|  7 |      60 |     0.5 |
 ---- --------- ---------

Now I want to filter. Smaller amounts of Count require a higher Value to get in focus.

The dependencies could look like: dict({100:0.5, 50:0.6, 40:0.7, 20:0.75, 10:0.8})
Examples:

if Count is above 100, Value requires only to be greater/equal 0.5
if Count is only 10 to 19, Value need to be greather/equal 0.8

I could filter it easily with:

df[((df["Count"]>=100) & (df["Value"]>=0.5)) |
   ((df["Count"]>=50) & (df["Value"]>=0.6)) |
   ((df["Count"]>=40) & (df["Value"]>=0.7)) |
   ((df["Count"]>=20) & (df["Value"]>=0.75)) |
   ((df["Count"]>=10) & (df["Value"]>=0.8))]

 ---- --------- --------- 
|    |   Count |   Value |
|---- --------- ---------|
|  1 |      17 |     0.9 |
|  2 |      56 |     0.6 |
|  4 |      80 |     0.7 |
|  5 |     190 |     0.6 |
 ---- --------- ---------

But want to change periodically the thresholds (also adding or removing threshold steps) without constantly changing the filter. How could I do this in pandas?

MWE

import pandas as pd

df = pd.DataFrame({
    "Count":[10,17,56,25,80,190,3,60],
    "Value":[0.5,0.9,0.6,0.7,0.7,0.6,0.8,0.5]
})

limits = dict({100:0.5, 50:0.6, 40:0.7, 20:0.75, 10:0.8})

R equivalent

In R I could solve a similar question with following code (thanks to akrun). But I don't know how to adapt to pandas.

library(data.table)

set.seed(33)

df = data.table(CPE=sample(1:500, 100), 
                PERC=runif(min = 0.1, max = 1, n=100))    
lst1 <- list(c(20, 0.95), c(50, 0.9), c(100,0.85), c(250,0.8))

df[Reduce(`|`, lapply(lst1, \(x) CPE > x[1] & PERC > x[2]))]

CodePudding user response：

I would use pandas.cut to perform the comparison in linear time. If you have many groups performing multiple comparisons will become inefficient (O(n*m) complexity):

# sorted bins and matching labels
bins = sorted(limits)
# [10, 20, 40, 50, 100]
labels = [limits[x] for x in bins]
# [0.8, 0.75, 0.7, 0.6, 0.5]

# mapping threshold from bins 
s = pd.cut(df['Count'], bins=[0] bins [np.inf], labels=[np.inf] labels, right=False).astype(float)

out = df[df['Value'].ge(s)]

Output:

   Count  Value
1     17    0.9
2     56    0.6
4     80    0.7
5    190    0.6

Intermediate s:

0    0.80
1    0.80
2    0.60
3    0.75
4    0.60
5    0.50
6     inf
7    0.60
Name: Count, dtype: float64

CodePudding user response：

Lets simplify your code by using boolean reduction with np.logical_or. This is also very close to what your are trying to do in R

c = ['Count', 'Value']
df[np.logical_or.reduce([df[c].ge(t).all(1) for t in limits.items()])]

   Count  Value
1     17    0.9
2     56    0.6
4     80    0.7
5    190    0.6