Home > Net >  Select greatest n per group using EXISTS
Select greatest n per group using EXISTS

Time:12-09

I have a RoadInsp table in Oracle 18c. I'll put the data in a CTE for purpose of this post:

with roadinsp (objectid, asset_id, date_) as (
select  1, 1, to_date('2016-04-01','YYYY-MM-DD') from dual union all
select  2, 1, to_date('2019-03-01','YYYY-MM-DD') from dual union all
select  3, 1, to_date('2022-01-01','YYYY-MM-DD') from dual union all
select  4, 2, to_date('2016-04-01','YYYY-MM-DD') from dual union all
select  5, 2, to_date('2021-01-01','YYYY-MM-DD') from dual union all
select  6, 3, to_date('2022-03-01','YYYY-MM-DD') from dual union all
select  7, 3, to_date('2016-04-01','YYYY-MM-DD') from dual union all
select  8, 3, to_date('2018-03-01','YYYY-MM-DD') from dual union all
select  9, 3, to_date('2013-03-01','YYYY-MM-DD') from dual union all
select 10, 3, to_date('2010-06-01','YYYY-MM-DD') from dual
)
select * from roadinsp 
  OBJECTID   ASSET_ID DATE_     
---------- ---------- ----------
         1          1 2016-04-01
         2          1 2019-03-01
         3          1 2022-01-01 --select this row

         4          2 2016-04-01
         5          2 2021-01-01 --select this row

         6          3 2022-03-01 --select this row
         7          3 2016-04-01
         8          3 2018-03-01
         9          3 2013-03-01
        10          3 2010-06-01

I'm using GIS software that only lets me use SQL in a WHERE clause/SQL expression, not a full SELECT query.

I want to select the greatest n per group using the WHERE clause. In other words, for each ASSET_ID, I want to select the row that has the latest date.


As an experiment, I want to make the selection specifically using the EXISTS operator.

The reason being: While this post technically pertains to Oracle (since that's what S.O. community members would have access to), in practice, I want to use the logic in a proprietary database called a file geodatabase. The file geodatabase has very limited SQL support; a small subset of SQL-92 syntax. But it does seem to support EXISTS and subqueries, although not correlated subqueries, group by, joins, or any modern SQL syntax. Very frustrating.

SQL reference for query expressions used in ArcGIS

Subquery support in file geodatabases is limited to the following:

  • Scalar subqueries with comparison operators. A scalar subquery returns a single value, for example:

    GDP2006 > (SELECT MAX(GDP2005) FROM countries)

    For file geodatabases, the set functions AVG, COUNT, MIN, MAX, and SUM can only be used in scalar subqueries.

  • EXISTS predicate, for example:

    EXISTS (SELECT * FROM indep_countries WHERE COUNTRY_NAME = 'Mexico')


Question:

Using the EXISTS operator, is there a way to select the greatest n per group? (keeping in mind the limitations mentioned above)

CodePudding user response:

rank analytic function does the job, if it is available to you (Oracle 18c certainly does support it).

Sample data:

SQL> with roadinsp (objectid, asset_id, date_) as (
  2  select  1, 1, to_date('2016-04-01','YYYY-MM-DD') from dual union all
  3  select  2, 1, to_date('2019-03-01','YYYY-MM-DD') from dual union all
  4  select  3, 1, to_date('2022-01-01','YYYY-MM-DD') from dual union all
  5  select  4, 2, to_date('2016-04-01','YYYY-MM-DD') from dual union all
  6  select  5, 2, to_date('2021-01-01','YYYY-MM-DD') from dual union all
  7  select  6, 3, to_date('2022-03-01','YYYY-MM-DD') from dual union all
  8  select  7, 3, to_date('2016-04-01','YYYY-MM-DD') from dual union all
  9  select  8, 3, to_date('2018-03-01','YYYY-MM-DD') from dual union all
 10  select  9, 3, to_date('2013-03-01','YYYY-MM-DD') from dual union all
 11  select 10, 3, to_date('2010-06-01','YYYY-MM-DD') from dual
 12  ),

Query begins here: first rank rows per asset_id by date in descending order:

 13  temp as
 14    (select r.*,
 15       rank() over (partition by asset_id order by date_ desc) rnk
 16     from roadinsp r
 17    )

Finally, fetch rows that rank as the highest:

 18  select *
 19  from temp
 20  where rnk = 1;

  OBJECTID   ASSET_ID DATE_             RNK
---------- ---------- ---------- ----------
         3          1 2022-01-01          1
         5          2 2021-01-01          1
         6          3 2022-03-01          1

SQL>

If you can't use that, how about a subquery?

<snip>
 13  select r.objectid, r.asset_id, r.date_
 14  from roadinsp r
 15  where (r.asset_id, r.date_) in (select t.asset_id, t.max_date
 16                                  from (select a.asset_id, max(a.date_) max_date
 17                                        from roadinsp a
 18                                        group by a.asset_id
 19                                       ) t
 20                                 );

  OBJECTID   ASSET_ID DATE_
---------- ---------- ----------
         6          3 2022-03-01
         5          2 2021-01-01
         3          1 2022-01-01

SQL>

CodePudding user response:

This should work:

select * 
from roadinsp r
where not exists (select 1
                  from roadinsp pr
                  where r.asset_id=pr.asset_id
                  and r.date_< pr.date_)

We get only the entry for each asset_id if there is no (NOT EXISTS) a later (by date) row; only the last rows of each asset satisfy this.

  • Related