I'm having a dataframe that looks like:
---- --------- ---------
| | Count | Value |
|---- --------- ---------|
| 0 | 10 | 0.5 |
| 1 | 17 | 0.9 |
| 2 | 56 | 0.6 |
| 3 | 25 | 0.7 |
| 4 | 80 | 0.7 |
| 5 | 190 | 0.6 |
| 6 | 3 | 0.8 |
| 7 | 60 | 0.5 |
---- --------- ---------
Now I want to filter. Smaller amounts of Count
require a higher Value
to get in focus.
The dependencies could look like: dict({100:0.5, 50:0.6, 40:0.7, 20:0.75, 10:0.8})
Examples:
- if
Count
is above100
,Value
requires only to be greater/equal0.5
- if
Count
is only10
to19
,Value
need to be greather/equal0.8
I could filter it easily with:
df[((df["Count"]>=100) & (df["Value"]>=0.5)) |
((df["Count"]>=50) & (df["Value"]>=0.6)) |
((df["Count"]>=40) & (df["Value"]>=0.7)) |
((df["Count"]>=20) & (df["Value"]>=0.75)) |
((df["Count"]>=10) & (df["Value"]>=0.8))]
---- --------- ---------
| | Count | Value |
|---- --------- ---------|
| 1 | 17 | 0.9 |
| 2 | 56 | 0.6 |
| 4 | 80 | 0.7 |
| 5 | 190 | 0.6 |
---- --------- ---------
But want to change periodically the thresholds (also adding or removing threshold steps) without constantly changing the filter. How could I do this in pandas?
MWE
import pandas as pd
df = pd.DataFrame({
"Count":[10,17,56,25,80,190,3,60],
"Value":[0.5,0.9,0.6,0.7,0.7,0.6,0.8,0.5]
})
limits = dict({100:0.5, 50:0.6, 40:0.7, 20:0.75, 10:0.8})
R equivalent
In R
I could solve a similar question with following code (thanks to akrun). But I don't know how to adapt to pandas.
library(data.table)
set.seed(33)
df = data.table(CPE=sample(1:500, 100),
PERC=runif(min = 0.1, max = 1, n=100))
lst1 <- list(c(20, 0.95), c(50, 0.9), c(100,0.85), c(250,0.8))
df[Reduce(`|`, lapply(lst1, \(x) CPE > x[1] & PERC > x[2]))]
CodePudding user response:
I would use pandas.cut
to perform the comparison in linear time. If you have many groups performing multiple comparisons will become inefficient (O(n*m) complexity):
# sorted bins and matching labels
bins = sorted(limits)
# [10, 20, 40, 50, 100]
labels = [limits[x] for x in bins]
# [0.8, 0.75, 0.7, 0.6, 0.5]
# mapping threshold from bins
s = pd.cut(df['Count'], bins=[0] bins [np.inf], labels=[np.inf] labels, right=False).astype(float)
out = df[df['Value'].ge(s)]
Output:
Count Value
1 17 0.9
2 56 0.6
4 80 0.7
5 190 0.6
Intermediate s
:
0 0.80
1 0.80
2 0.60
3 0.75
4 0.60
5 0.50
6 inf
7 0.60
Name: Count, dtype: float64
CodePudding user response:
Lets simplify your code by using boolean reduction with np.logical_or
. This is also very close to what your are trying to do in R
c = ['Count', 'Value']
df[np.logical_or.reduce([df[c].ge(t).all(1) for t in limits.items()])]
Count Value
1 17 0.9
2 56 0.6
4 80 0.7
5 190 0.6