Home > Software engineering >  How to correctly compute cumulative distribution on sample dataset
How to correctly compute cumulative distribution on sample dataset

Time:06-08

Consider the following table and data set.

CREATE TABLE MY_TABLE(
  "Group" TEXT,
  Callees INTEGER,
  Callers INTEGER
  );
  
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 1, 505);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 2, 172);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 3, 33);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 4, 20);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 5, 5);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 6, 5);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 7, 3);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 8, 4);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 9, 3);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 10, 2);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 11, 1);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 13, 1);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 14, 1);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 16, 1);
INSERT INTO MY_TABLE("Group", Callees, Callers) VALUES ('Group1', 22, 2);

I want to create a cumulative distribution that shows cumulative % of total Callers based on the values in the Callees column, for the group. For example, for the value Callees = 1, I would like the first output row:

Group1  1 505 0.6662
Group1  2 172 0.8931
Group1  3  33 0.9366
...
Group1 22   1 1.0000

Where 0.6662 is 505 divided by total number of Callers, 758.Then the following rows should would contain the cumulative percentage as we go down the Callees column values.

I tried:

select "Group", Callers,
cume_dist() over ( order by Callees desc) as CumulativeP
from MY_TABLE
order by "Group", Callees

But it does not give me the output I want. How can I make this happen?

I am using SQLite 3.3x

CodePudding user response:

Solution for your problem:

select "Group", Callees,mt.Callers,
ROUND((sum(callers) over(Partition By "Group" order by Callees) * 1.00) / (sum(callers) over (Partition By "Group" RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)),4) CumulativeP
from MY_TABLE mt
order by "Group", Callees;

Working Example: db fiddle link

Explanation:
Window function contains Window Frame to define the number of rows or the range that should be included by the function for calculation.
For example, to calculate cumulative sum, the sum should be applied from the first row till the current row for each row.
Now to calculate the sum over the whole table or for all rows (Within the defined partition), the range should be from first row to the last row of the table(or the partition if defined) and this can be defined with UNBOUNDED PRECEDING (which tells to take the rows from the first row of the table or partition) and 'UNBOUNDED FOLLOWING' (which tell to take the rows till the last one in the table or partition).

Syntaxes of the Window Frame are:

( GROUPS | RANGE | ROWS ) start

(GROUPS | RANGE | ROWS ) BETWEEN start AND end  

Where start can be:

UNBOUNDED PRECEDING
expr PRECEDING
CURRENT ROW
expr FOLLOWING

and end can be:

expr PRECEDING
CURRENT ROW
UNBOUNDED FOLLOWING
expr FOLLOWING

For more info on Window Function please see the below link:
Official Document

For better understanding about Window Frame, follow below link:
Sqlite window frame

  • Related