pandas - ranking with tolerance?-CodePudding

Is there a way to rank values in a dataframe but considering a tolerance?

Say I have the following values

ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])

and if I ran rank:

ex.rank(method='average') 
0    2.0 
1    3.0 
2    1.0 
3    6.0 
4    5.0
5    4.0 
dtype: float64

But what I'd like as a result would be (with a tolereance of 0.01):

Any way to define this tolerance?

Thanks

CodePudding user response：

This function may works:

def rank_with_tolerance(sr, tolerance=0.01 1e-10, method='average'):
    
    vals = pd.Series(sr.unique()).sort_values()
    vals.index = vals
    vals = vals.mask(vals - vals.shift(1) <= tolerance, vals.shift(1))
    
    return sr.map(tvals).fillna(sr).rank(method=method)

It works for your given input:

ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
rank_with_tolerance(ex, tolerance=0.01 1e-10, method='average')

# result:
0    2.0
1    3.5
2    1.0
3    6.0
4    5.0
5    3.5
dtype: float64

And with more complex sets it seems to work too:

ex = pd.Series([16.52,19.95,19.96, 19.95, 19.97, 19.97, 19.98])
rank_with_tolerance(ex, tolerance=0.01 1e-10, method='average')

# result:
0    1.0
1    3.0
2    3.0
3    3.0
4    5.5
5    5.5
6    7.0
dtype: float64

CodePudding user response：

You could do some sort of min-max scaling i.e.

ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])

# You scale the values to be between 0 and 1
ex_scaled = (ex - min(ex)) / (max(ex) - min(ex))

# You put them on a scale from 1 to the length of your series
result = ex_scaled * len(ex)   1


# result
0    1.335347
1    4.444109
2    1.000000
3    7.000000
4    4.969789
5    4.453172

That way you are still ranking, but values closer to each other have ranks close to each other

CodePudding user response：

You can sort the values, merge the close ones and rank on that:

s = ex.drop_duplicates().sort_values()
mapper = (s.groupby(s.diff().abs().gt(0.011).cumsum(), sort=False)
           .transform('mean')
           .reindex_like(ex)
         )

out = mapper.rank(method='average')

N.B. I used 0.011 as threshold as floating point arithmetics does not always enable enough precision to detect a value clode to the threshold

output:

0    2.0
1    3.5
2    1.0
3    6.0
4    5.0
5    3.5
dtype: float64

intermediate mapper:

0    16.520
1    19.955
2    16.150
3    22.770
4    20.530
5    19.955
dtype: float64