I have series of numbers that add up to nearly 100 (they are percentages), for example:
A | B | C |
---|---|---|
49.99 | 9.99 | 40.01 |
And I would like to adjust these to get something like:
A | B | C |
---|---|---|
50.00 | 10.00 | 40.00 |
Constraints :
- The numbers count may vary,
- The initial sum can be a little over/under 100 (I use a threshold of 0.1 : if the sum is below 99.9 or over 100.1 then I don't adjust),
- sometimes there can be no "nice/round" numbers. In that case I want to distribute the missing/exceeding 1/100s in favor of the lowest numbers like so:
A | B | C | D | E | F | G |
---|---|---|---|---|---|---|
33.33 | 16.66 | 8.33 | 8.33 | 9.99 | 12.50 | 10.83 |
33.33 | 16.66 | 8.34 | 8.34 | 10.00 | 12.50 | 10.83 |
The function I currently use is working great for distributing the missing/exceeding 1.100s in a loop but does not address the 49.99 → 50.00 problem.
The principle it uses is :
- multiply all the % by 100 (to get integers)
- assign a "weight" to each integer based on the number of trailing zeroes (so that I adjust preferably the non-round numbers):
- 30 when (int mod 1000)=0
- 20 when (int mod 100)=0
- 10 when (int mod 10)=0
- 5 when (int mod 5) = 0
- else 0.
- calculate the missing/exceeding 1/100s for the lowest weight
- sort the ints to get the lowest ints first (the sort takes the weight into account)
- add/substract 1 to each int until I get a sum of 100.
The resulting function will be written in Postgres' Pl/PgSql but I'm mostly interested in knowing if such algorithms exist and how they are named (of course, a link to a working function would be very much appreciated).
CodePudding user response:
I ended up splitting the problem in 2 :
- adjust the original shares to get "nicer" numbers (49.99 → 50.00) and
- add a few 1/100s here and there to obtain a 100% total.
The result is satisfying, I even integrated a few special values for common fractions (200/3, 100/3, 100/6, 100/12...) so that 3x33.33 don't end up with 33.35, 33.35 and 33.30, less "nice/round" but more fair. Note there is only one loop for the final adjustments. Performance is acceptable : 3.5 sec for 100000 rows.
The following SO question and the included Wikipedia article helped me understand the possible biases and their pros/cons.
Here is the code for those interested :
CREATE OR REPLACE FUNCTION public.normalizeshares_weight(share INT) RETURNS INT AS
$BODY$
SELECT CASE
WHEN share % 10000 = 0 THEN 40
WHEN share % 1000 = 0 THEN 30
WHEN share % 100 = 0 THEN 20
WHEN share % 50 = 0 OR (share = ANY('{6666,3333,1666,833,416}')) THEN 15
WHEN (share % 10 = 0) THEN 10
WHEN share % 5 = 0 THEN 5
ELSE 0 END;
$BODY$ LANGUAGE SQL IMMUTABLE;
CREATE OR REPLACE FUNCTION public.normalizeshares(shares NUMERIC[]) RETURNS NUMERIC(5,2)[] AS
$BODY$
DECLARE
intshares INT[];
adjshares INT[];
weight INT[];
result NUMERIC[];
nb0 INT = 0;
nb5 INT = 0;
nb10 INT = 0;
nb15 INT = 0;
nb20 INT = 0;
nb30 INT = 0;
nb40 INT = 0;
initot INT = 0;
tot INT = 0;
nb INT = 0;
w INT = 0;
diff INT;
each INT;
bestweight INT;
BEGIN
FOR i IN 1..ARRAY_LENGTH(shares,1) LOOP
intshares[i] := FLOOR(COALESCE(shares[i],0)*100);
weight[i] := normalizeshares_weight(intshares[i]);
bestweight := weight[i];
adjshares[i] := intshares[i];
IF normalizeshares_weight(intshares[i] 1) > bestweight THEN adjshares[i] := intshares[i] 1; bestweight := normalizeshares_weight(intshares[i] 1); END IF;
IF normalizeshares_weight(intshares[i] 2) > bestweight THEN adjshares[i] := intshares[i] 2; bestweight := normalizeshares_weight(intshares[i] 2); END IF;
IF normalizeshares_weight(intshares[i] 3) > bestweight THEN adjshares[i] := intshares[i] 2; bestweight := normalizeshares_weight(intshares[i] 3); END IF;
IF normalizeshares_weight(intshares[i]-1) > bestweight THEN adjshares[i] := intshares[i]-1; bestweight := normalizeshares_weight(intshares[i]-1); END IF;
IF normalizeshares_weight(intshares[i]-2) > bestweight THEN adjshares[i] := intshares[i]-2; bestweight := normalizeshares_weight(intshares[i]-2); END IF;
IF normalizeshares_weight(intshares[i]-3) > bestweight THEN adjshares[i] := intshares[i]-2; bestweight := normalizeshares_weight(intshares[i]-3); END IF;
tot := tot adjshares[i];
initot := initot intshares[i];
weight[i] := bestweight; -- normalizeshares_weight(adjshares[i]); already calculated
IF weight[i]=0 THEN nb0 := nb0 1;
ELSIF weight[i]=5 THEN nb5 := nb5 1;
ELSIF weight[i]=10 THEN nb10 := nb10 1;
ELSIF weight[i]=15 THEN nb15 := nb15 1;
ELSIF weight[i]=20 THEN nb20 := nb20 1;
ELSIF weight[i]=30 THEN nb30 := nb30 1;
ELSIF weight[i]=40 THEN nb40 := nb40 1;
END IF;
result[i] := (intshares[i]::NUMERIC / 100)::NUMERIC(5,2);
END LOOP;
IF tot=10000 THEN
-- RAISE NOTICE 'adjtot=100.00 : %',adjshares::numeric[];
FOR i IN 1..ARRAY_LENGTH(shares,1) LOOP
result[i] := (adjshares[i]::NUMERIC / 100)::NUMERIC(5,2);
END LOOP;
ELSIF (initot=10000) OR (ABS(10000-tot)>90) THEN
-- RAISE NOTICE 'No adj needed, initot=%, tot=%',initot,tot;
ELSE
IF nb0 > 0 THEN nb := nb0; w := 0;
ELSIF nb5 > 0 THEN nb := nb5; w := 5;
ELSIF nb10 > 0 THEN nb := nb10; w := 10;
ELSIF nb15 > 0 THEN nb := nb15; w := 15;
ELSIF nb20 > 0 THEN nb := nb20; w := 20;
ELSIF nb30 > 0 THEN nb := nb30; w := 30;
ELSIF nb40 > 0 THEN nb := nb40; w := 40;
END IF;
diff := 10000 - tot;
each := diff/nb diff/abs(diff);
-- RAISE NOTICE 'nb=%, w=%, diff=%, tot=%, adj=%',nb,w,diff,tot,adjshares::numeric[];
FOR i IN 1..ARRAY_LENGTH(shares,1) LOOP
IF weight[i]=w THEN
IF diff=0 THEN
ELSIF nb=1 THEN
adjshares[i] := adjshares[i] diff;
ELSIF nb>1 THEN
adjshares[i] := adjshares[i] each;
diff := diff - each;
END IF;
nb := nb -1;
END IF;
result[i] := (adjshares[i]::NUMERIC / 100)::NUMERIC(5,2);
END LOOP;
END IF;
RETURN result;
END;
$BODY$ LANGUAGE plpgsql IMMUTABLE;
And a few results :
% select normalizeshares('{49.99,9.99,40.01}');
normalizeshares
---------------------
{50.00,10.00,40.00}
% select normalizeshares('{33.33,16.66,8.33,8.33,9.99,12.5,10.83}');
normalizeshares
-------------------------------------------
{33.33,16.66,8.33,8.33,10.00,12.50,10.85}