Home > Back-end >  pandas - ranking with tolerance?
pandas - ranking with tolerance?

Time:07-13

Is there a way to rank values in a dataframe but considering a tolerance?

Say I have the following values

ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])

and if I ran rank:

ex.rank(method='average') 
0    2.0 
1    3.0 
2    1.0 
3    6.0 
4    5.0
5    4.0 
dtype: float64

But what I'd like as a result would be (with a tolereance of 0.01):

0    2.0 
1    3.5 
2    1.0 
3    6.0 
4    5.0
5    3.5 

Any way to define this tolerance?

Thanks

CodePudding user response:

This function may works:

def rank_with_tolerance(sr, tolerance=0.01 1e-10, method='average'):
    
    vals = pd.Series(sr.unique()).sort_values()
    vals.index = vals
    vals = vals.mask(vals - vals.shift(1) <= tolerance, vals.shift(1))
    
    return sr.map(tvals).fillna(sr).rank(method=method)

It works for your given input:

ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
rank_with_tolerance(ex, tolerance=0.01 1e-10, method='average')

# result:
0    2.0
1    3.5
2    1.0
3    6.0
4    5.0
5    3.5
dtype: float64

And with more complex sets it seems to work too:

ex = pd.Series([16.52,19.95,19.96, 19.95, 19.97, 19.97, 19.98])
rank_with_tolerance(ex, tolerance=0.01 1e-10, method='average')

# result:
0    1.0
1    3.0
2    3.0
3    3.0
4    5.5
5    5.5
6    7.0
dtype: float64

CodePudding user response:

You could do some sort of min-max scaling i.e.

ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])

# You scale the values to be between 0 and 1
ex_scaled = (ex - min(ex)) / (max(ex) - min(ex))

# You put them on a scale from 1 to the length of your series
result = ex_scaled * len(ex)   1


# result
0    1.335347
1    4.444109
2    1.000000
3    7.000000
4    4.969789
5    4.453172

That way you are still ranking, but values closer to each other have ranks close to each other

CodePudding user response:

You can sort the values, merge the close ones and rank on that:

s = ex.drop_duplicates().sort_values()
mapper = (s.groupby(s.diff().abs().gt(0.011).cumsum(), sort=False)
           .transform('mean')
           .reindex_like(ex)
         )

out = mapper.rank(method='average')

N.B. I used 0.011 as threshold as floating point arithmetics does not always enable enough precision to detect a value clode to the threshold

output:

0    2.0
1    3.5
2    1.0
3    6.0
4    5.0
5    3.5
dtype: float64

intermediate mapper:

0    16.520
1    19.955
2    16.150
3    22.770
4    20.530
5    19.955
dtype: float64
  • Related