Home > front end >  Using a WHERE clause subquery, select the greatest n per group with a tie-breaker
Using a WHERE clause subquery, select the greatest n per group with a tie-breaker

Time:12-11

I have a RoadInsp table in a SQLite 3.39 database. I've put the data in a CTE for purpose of this question:

with roadinsp (objectid, asset_id, date_, condition) as (
values
(1, 1, '2016-04-01', 20),
(2, 1, '2019-03-01', 19),
(3, 1, '2022-01-01', 18),
  
(4, 2, '2016-04-01', 17),
(5, 2, '2022-01-01', 16),
  
(6, 3, '2022-03-01', 15),  --duplicate date
(7, 3, '2022-03-01', 14),  --duplicate date
(8, 3, '2019-01-01', 13)
)  
select * from roadinsp

objectid  asset_id      date_   condition
       1         1  2016-04-01         20
       2         1  2019-03-01         19
       3         1  2022-01-01         18

       4         2  2016-04-01         17
       5         2  2022-01-01         16

       6         3  2022-03-01         15
       7         3  2022-03-01         14
       8         3  2019-01-01         13

db<>fiddle demo


I'm using GIS software that only lets me use SQL in a WHERE clause/SQL expression, not a full SELECT query.

I want to select the greatest n per group using a WHERE clause. In other words, for each ASSET_ID, I want to select the row that has the latest date.

I can achieve that using a WHERE clause expression like this:

date_ = 
  (select max(subq.date_) from roadinsp subq where roadinsp.asset_id = subq.asset_id)

objectid  asset_id      date_   condition
       3         1  2022-01-01         18
       5         2  2022-01-01         16
       6         3  2022-03-01         15
       7         3  2022-03-01         14

That works, but it selects two rows for asset #3, since there are two rows with the same date for that asset.

So I want to break the tie by selecting the row that has the highest condition value. It would look like this:

objectid  asset_id  date_       condition
       3         1  2022-01-01         18
       5         2  2022-01-01         16
       6         3  2022-03-01         15  --this row has a higher condition value than the other duplicate row.
                                           --so the other duplicate row was omitted.

I only ever want to select one row per asset. So if there are duplicate condition values too, then it doesn't matter which one is selected, as long as only one is selected.


Using a WHERE clause subquery, how can I select the greatest n per group, and break ties with the row that has the largest condition?

CodePudding user response:

Perhaps you could add another similar condition to the where clause (as I understand your question, that's all you can modify?):

select * 
from roadinsp r
where 
  date_ = (
    select max(subq.date_) 
    from roadinsp subq 
    where r.asset_id = subq.asset_id
 )
 and condition = (
   select Max(condition) 
   from roadinsp c 
   where c.date_ = r.date_ 
   and c.asset_id = r.asset_id
);

Updated Fiddle

CodePudding user response:

You can avoid the MAX aggregation, and making it more efficient, by using the ROW_NUMBER window function. It will generate an increasing ranking value, given a partitioning and an order inside the partition you identify. In your case you need to:

  • partition on the "asset_id" field
  • order on the "date_" and "condition" fields, descendently for both

Once you have that, it is enough to filter out all rows that have ranking higher than 1 (the rows you want to exclude, that have lower "date_" and "condition" values inside each "asset_id" partition).

WITH cte AS (
    SELECT *, ROW_NUMBER() OVER(PARTITION BY asset_id ORDER BY date_ DESC, condition DESC) AS rn 
    from roadinsp  
)
SELECT * 
FROM cte
WHERE rn = 1

Check the demo here.

  • Related