Home > Mobile >  How to calculate percentile-like mean, does it even have a real mathematical name?
How to calculate percentile-like mean, does it even have a real mathematical name?

Time:12-29

Let S={s0, s1, ..., sn} and siR

Median m of set S is also known as a 50-percentile because:

50% * card(si < m) = 50% * card(si > m)

If m is α-percentile then:

(100% - α) * card(si < m)) = α * card(si > m)

Assuming we want simmilar behaviour from mean we can say that: Mean m of set S is also known as a 50-percentile-like-mean because:

50% * sumni=0, si < m ( m - si ) = 50% * sumni=0, si > m ( m - si )

Then we can define α-percentile-like-mean to be:

(100% - α) * sumni=0, si < m ( m - si ) = α * sumni=0, si > m ( m - si )

My questions are:

  1. Does percentile-like-mean have a real mathematical name ?
  2. Can this equation be solved (find m for some given set S) in a linear time (like mean) ?
  3. If not can this be solved using something faster than binary search for m (card(S) ~= 107) ?

I have only managed to come up with binary search: Let a,b = min(S), max(S), let m = a=b/2, naively calculate both sides of the equation and adjust a=m or b=m depending on which side was greater.

Maybe the the binary search can be further optimised ? Like spliting range a-b with α and 1-α instead 50%-50%, or using the difference between both sides of the equation somehow ?

Best regards

CodePudding user response:

You have a mistake in your definition on the percentile-like-mean: one side is always negative, the other is always positive. To fix that you need to negate one of the sides:

(1 - α) Σsi<m (m - si) = α Σsi>=m (si - m)

Then it can be verified that it is equivalent with the arithmetic mean for α = 50%. I also include the si = m in the sum; this doesn't change the definition, but simplifies the math.

The above equation can be rewritten as:

m = ((1 - α) Σsi<m si α Σsi>=m si) / ((1 - α) Σsi<m 1 α Σsi>=m 1)

Assuming that si are sorted and sj <= m < sj 1 for some j, it can be rewritten as:

m = ((1 - α) Σi<j si α Σi>=j si) / ((1 - α) j α (n - j))

What's left is to check if any j solves the equation, which can be done in linear time by maintaining the partial sums on the left and on the right:

double left = 0, right = 0;
for(int i = 0; i < n;   i)
    right  = s[i];

for(int j = 0; j 1 < n;   j) {
    left  = s[j];
    right -= s[j];
    double m = ((1-a)*left   a*right)/((1-a)*(j 1)   a*(n-j-1));
    if(s[j] <= m && m < s[j 1])
        return m;
}

CodePudding user response:

I've modified your definition a bit because of Yakov Galka's observations.

(100 - a) * sum_{s_i < m} (m - s_i) = a * sum_{s_i > m} (s_i - m)

This can be rearranged as follows,...

m((100 - a) * count_{s_i < m}   a * count_{s_i > m}) = (a * sum_{s_i > m} s_i)   ((100 - a) * sum_{s_i < m} s_i)

If the s_i are sorted in ascending order, then making a guess for count_{s_i < m} makes it easy to calculate m. It makes more sense to me to do an integer binary search on count_{s_i < m} to find a value of m that satisfies the equation. It isn't clear to me at this time that a solution will always exist or that a solution would be unique. I think this approach would be O(nlogn) so should be achievable in a reasonable time even if n approx 10^7.

  • Related