how can i obtain the primary key for a row associated with the minimum value in a subset specified i-CodePudding

i have been working on some database queries and am running into some issues with trying to obtain a primary key for a row where an aggregated value was obtained. i am using postgresql v10.

take for example the following table:

 ---------------------- --------------- ----------- --------------------------------- 
|instructor_schedule_id|instructor_name|course_name|start_date                       |
 ---------------------- --------------- ----------- --------------------------------- 
|1                     |bob            |databases  |2015-01-01 00:00:00.000000  00:00|
|2                     |bob            |databases  |2018-01-01 00:00:00.000000  00:00|
|3                     |bob            |databases  |2024-01-01 00:00:00.000000  00:00|
|4                     |alice          |databases  |2021-01-01 00:00:00.000000  00:00|
|5                     |alice          |databases  |2022-01-01 00:00:00.000000  00:00|
 ---------------------- --------------- ----------- ---------------------------------

what i would like to obtain, is the "current" value for each instructor/course combination, where current is determined by finding the most recent record from the past.

my desired result set would look like this:

 ---------------------- --------------- ----------- --------------------------------- 
|instructor_schedule_id|instructor_name|course_name|start_date                       |
 ---------------------- --------------- ----------- --------------------------------- 
|2                     |bob            |databases  |2018-01-01 00:00:00.000000  00:00|
|5                     |alice          |databases  |2022-01-01 00:00:00.000000  00:00|
 ---------------------- --------------- ----------- ---------------------------------

as you can see, we have the most recent assignment to each instructor for each of their courses (only one course in this example, but there are typically many). bobs 2024 assignment hasn't happened yet, and bobs 2015 assignment has expired since we now have a 2018 assignment. same for alice.

looking at this, what i WANT to do is:

select * from instructor_schedule where instructor_schedule_id in (
    select instructor_schedule_id from (select
        min(start_date),
        course_name,
        instructor_name,
        instructor_schedule_id
    from
        instructor_schedule
    group by
        course_name, instructor_name) inner
);

which clearly isn't going to work due to instructor_schedule_id being present, since we can't select a column without including it in the group by or passing it to an aggregate function.

my question is, how can i get the primary key associated with the row where min(start_date) was obtained?

here is some DDL for reproducing:

drop table if exists instructor_schedule;
create table instructor_schedule(
    instructor_schedule_id bigserial primary key,
    instructor_name varchar(255),
    course_name varchar(255),
    start_date timestamp with time zone
);

alter table instructor_schedule add constraint no_dupes unique(instructor_name, course_name, start_date);

insert into instructor_schedule (instructor_name, course_name, start_date) values
('bob', 'databases', '2015-01-01 00:00:00.000000  00:00'),
('bob', 'databases', '2018-01-01 00:00:00.000000  00:00'),
('bob', 'databases', '2024-01-01 00:00:00.000000  00:00'),
('alice', 'databases', '2021-01-01 00:00:00.000000  00:00'),
('alice', 'databases', '2022-01-01 00:00:00.000000  00:00');

CodePudding user response：

thanks for the precise scripts, this helps.

Here is one way to accomplish what you are looking for. first step is to filter out for all dates prior to now()(ie ignore future dates) ,we can then we can rank them on the basis of the latest entry available, lastly we filter out on the entries where rnk=1 which will pick up those entries that are the latest.

with data
  as (
     select *,row_number() over(partition by instructor_name,course_name order by start_date desc) as rnk
       from instructor_schedule
      where start_date<=now()  --restrict records to entries which have occured in the past
     )
select *
  from data x
 where x.rnk=1 --as we are ranking using the earliest start_date rnk=1, gets you the latest entry record.

In your original query, you can accomplish the results you are looking for as below.

select orig.* 
  from instructor_schedule orig 
  join (select max(start_date) as max_start_date,
                   course_name,
                   instructor_name
              from instructor_schedule as is1
             where start_date<=now()             
          group by course_name, 
                   instructor_name
        ) inner1
    on orig.course_name=inner1.course_name
   and orig.instructor_name=inner1.instructor_name
   and orig.start_date=inner1.max_start_date

proof of example

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=09b536a0495ab8786cc2cd0dbf8077ed

CodePudding user response：

Try this:

with main_tab as (
    select *,
        row_number() over (partition by instructor_name order by start_date desc) rn
    from instructor_schedule
    where start_date <= now()
)
select * from main_tab where rn  = 1;

CodePudding user response：

with data as (
    select *,
        row_number() over (partition by instructor_name order by start_date desc) as rn
    from instructor_schedule
    where start_date <= now()
)
select * from data where rn  = 1;