Let S={s0, s1, ..., sn} and si ∈ R
Median m of set S is also known as a 50-percentile because:
50% * card(si < m) = 50% * card(si > m)
If m is α-percentile then:
(100% - α) * card(si < m)) = α * card(si > m)
Assuming we want simmilar behaviour from mean we can say that: Mean m of set S is also known as a 50-percentile-like-mean because:
50% * sumni=0, si < m ( m - si ) = 50% * sumni=0, si > m ( m - si )
Then we can define α-percentile-like-mean to be:
(100% - α) * sumni=0, si < m ( m - si ) = α * sumni=0, si > m ( m - si )
My questions are:
- Does percentile-like-mean have a real mathematical name ?
- Can this equation be solved (find m for some given set S) in a linear time (like mean) ?
- If not can this be solved using something faster than binary search for m (card(S) ~= 107) ?
I have only managed to come up with binary search: Let a,b = min(S), max(S), let m = a=b/2, naively calculate both sides of the equation and adjust a=m or b=m depending on which side was greater.
Maybe the the binary search can be further optimised ? Like spliting range a-b with α and 1-α instead 50%-50%, or using the difference between both sides of the equation somehow ?
Best regards
CodePudding user response:
You have a mistake in your definition on the percentile-like-mean: one side is always negative, the other is always positive. To fix that you need to negate one of the sides:
(1 - α) Σsi<m (m - si) = α Σsi>=m (si - m)
Then it can be verified that it is equivalent with the arithmetic mean for α = 50%. I also include the si = m in the sum; this doesn't change the definition, but simplifies the math.
The above equation can be rewritten as:
m = ((1 - α) Σsi<m si α Σsi>=m si) / ((1 - α) Σsi<m 1 α Σsi>=m 1)
Assuming that si are sorted and sj <= m < sj 1 for some j, it can be rewritten as:
m = ((1 - α) Σi<j si α Σi>=j si) / ((1 - α) j α (n - j))
What's left is to check if any j solves the equation, which can be done in linear time by maintaining the partial sums on the left and on the right:
double left = 0, right = 0;
for(int i = 0; i < n; i)
right = s[i];
for(int j = 0; j 1 < n; j) {
left = s[j];
right -= s[j];
double m = ((1-a)*left a*right)/((1-a)*(j 1) a*(n-j-1));
if(s[j] <= m && m < s[j 1])
return m;
}
CodePudding user response:
I've modified your definition a bit because of Yakov Galka's observations.
(100 - a) * sum_{s_i < m} (m - s_i) = a * sum_{s_i > m} (s_i - m)
This can be rearranged as follows,...
m((100 - a) * count_{s_i < m} a * count_{s_i > m}) = (a * sum_{s_i > m} s_i) ((100 - a) * sum_{s_i < m} s_i)
If the s_i are sorted in ascending order, then making a guess for count_{s_i < m} makes it easy to calculate m. It makes more sense to me to do an integer binary search on count_{s_i < m} to find a value of m that satisfies the equation. It isn't clear to me at this time that a solution will always exist or that a solution would be unique. I think this approach would be O(nlogn) so should be achievable in a reasonable time even if n approx 10^7.