Partition rows based on percent value of the data range-CodePudding

What will be the select query if I want to rank and partition rows based on percent range of the partitioning column?

For example, Let say I have the below table structure (the column 'Rank' needs to be populated). And I want to rank the rows based on the order of score for the row, but only within the /- 10% of the amount range from the current row's amount. That is, for the first row, the amount is 2.3, and /-10% of 2.3 is: 2.07 - 2.53. So while ranking first row, I should rank based on the score and consider only those rows which has the amount in the range 2.07 - 2.53 (in this case its id's 1,5,11). Based on this logic the percentile rank is populated in the last column, and the rank for first row will be 0.5. Similarly, perform the same steps for each row.

Question is how can I do this with PERCENT_RANK() or RANK() or NTILE() with partition clause as part of a select query? The original table does not have the last column, that is the column that needs to be populated. I need the percentile ranking of the row based on the score within the 10% range of amount.

PRODUCT_ID	Amount	Score	Percent_Rank
1	2.3	45	0.5
2	2.7	30	0
3	2.0	40	0.5
4	2.6	50	1
5	2.2	35	0
6	5.1	25	0
7	4.8	40	1
8	6.1	60	0
9	22.1	70	0.33
10	8.2	20	0
11	2.1	50	1
12	22.2	60	0
13	22.3	80	1
14	22.4	75	0.66

I tried using the PERCENT_RANK() over partition() but its not considering the range. I cannot use range unbounded preceding and following in the partition clause because I need the range to be within 10% of the amount in the current row.

CodePudding user response：

You may try PERCENT_RANK() with a self join as the following:

SELECT PRODUCT_ID, Amount, Score, Percent_Rank
FROM
(
  SELECT A.PRODUCT_ID, A.Amount ,A.Score, B.Amount AS B_Amount,
       PERCENT_RANK() OVER (PARTITION BY A.PRODUCT_ID ORDER BY B.SCORE) Percent_Rank
  FROM table_name A JOIN table_name B
  ON B.Amount BETWEEN A.Amount-A.Amount *0.1 AND A.Amount A.Amount*0.1
) T
WHERE Amount=B_Amount

See a demo.

CodePudding user response：

I think that you can just nest your percent_rank in a subquery once you have calculated the bucket number based on equally spaced scores.

The trickiest part of this example is actually getting the fixed width buckets. It might be simpler if we could use width_bucket() but some databases don't support that, so I had to compute manually (in 'bucketed' inline table).

Here is the example. I used postgres to create the mockup test table, because it has a very nice generate_series(), but the actual example SQL should run on any database.

create table product_scores as (
  select 
    product_id,
    score 
  from 
    generate_series(1,2) product_id,
    generate_series(1,50) score);

This created a table with two product ids and 50 scores for each one.

with ranges as (
  select 
    product_id,
    (max(score)-min(score))*(1 1e-10) range,
    min(score) minscore  from product_scores 
    group by product_id),
 bucketed as ( 
   select 
     ranges.product_id,
     score,
     floor((score-minscore)*10.0/range) bucket 
   from 
     ranges 
   inner 
     join product_scores 
   on 
     ranges.product_id=product_scores.product_id) 
 select 
   product_id,
   score,
   bucket,
   percent_rank() 
     over (partition by product_id,bucket order by score) from bucketed;

No the 1e-10 is not a joke. Unfortunately roundoff error would assign the highest value to a bucket all by itself unless we expand the range by a tiny amount. But once we have a workable range we can then calculate the partition easily enough by checking range.

Then having the partition number you can to the percent_rank() as usual, as shown.