How to calculate percentile-like mean, does it even have a real mathematical name?-CodePudding

Let S={s₀, s₁, ..., s_n} and s_i ∈ R

Median m of set S is also known as a 50-percentile because:

50% * card(s_i < m) = 50% * card(s_i > m)

If m is α-percentile then:

(100% - α) * card(s_i < m)) = α * card(s_i > m)

Assuming we want simmilar behaviour from mean we can say that: Mean m of set S is also known as a 50-percentile-like-mean because:

50% * sumⁿ_{i=0, s_i < m}( m - s_i ) = 50% * sumⁿ_{i=0, s_i > m}( m - s_i )

Then we can define α-percentile-like-mean to be:

(100% - α) * sumⁿ_{i=0, s_i < m}( m - s_i ) = α * sumⁿ_{i=0, s_i > m}( m - s_i )

My questions are:

Does percentile-like-mean have a real mathematical name ?
Can this equation be solved (find m for some given set S) in a linear time (like mean) ?
If not can this be solved using something faster than binary search for m (card(S) ~= 10⁷) ?

I have only managed to come up with binary search: Let a,b = min(S), max(S), let m = a=b/2, naively calculate both sides of the equation and adjust a=m or b=m depending on which side was greater.

Maybe the the binary search can be further optimised ? Like spliting range a-b with α and 1-α instead 50%-50%, or using the difference between both sides of the equation somehow ?

Best regards

CodePudding user response：

You have a mistake in your definition on the percentile-like-mean: one side is always negative, the other is always positive. To fix that you need to negate one of the sides:

(1 - α) Σ_{s_i<m} (m - s_i) = α Σ_{s_i>=m} (s_i - m)

Then it can be verified that it is equivalent with the arithmetic mean for α = 50%. I also include the s_i = m in the sum; this doesn't change the definition, but simplifies the math.

The above equation can be rewritten as:

m = ((1 - α) Σ_{s_i<m} s_i α Σ_{s_i>=m} s_i) / ((1 - α) Σ_{s_i<m} 1 α Σ_{s_i>=m} 1)

Assuming that s_i are sorted and s_j <= m < s_{j 1} for some j, it can be rewritten as:

m = ((1 - α) Σ_i<j s_i α Σ_i>=j s_i) / ((1 - α) j α (n - j))

What's left is to check if any j solves the equation, which can be done in linear time by maintaining the partial sums on the left and on the right:

double left = 0, right = 0;
for(int i = 0; i < n;   i)
    right  = s[i];

for(int j = 0; j 1 < n;   j) {
    left  = s[j];
    right -= s[j];
    double m = ((1-a)*left   a*right)/((1-a)*(j 1)   a*(n-j-1));
    if(s[j] <= m && m < s[j 1])
        return m;
}

CodePudding user response：

I've modified your definition a bit because of Yakov Galka's observations.

(100 - a) * sum_{s_i < m} (m - s_i) = a * sum_{s_i > m} (s_i - m)

This can be rearranged as follows,...

m((100 - a) * count_{s_i < m}   a * count_{s_i > m}) = (a * sum_{s_i > m} s_i)   ((100 - a) * sum_{s_i < m} s_i)

If the s_i are sorted in ascending order, then making a guess for count_{s_i < m} makes it easy to calculate m. It makes more sense to me to do an integer binary search on count_{s_i < m} to find a value of m that satisfies the equation. It isn't clear to me at this time that a solution will always exist or that a solution would be unique. I think this approach would be O(nlogn) so should be achievable in a reasonable time even if n approx 10^7.