Postgres Distinct Order by-CodePudding

I have to tables that I want to join, order by two timestamps and get as result the distinct values (for several columns). But it doesn't work. See examples below:

CREATE TABLE t1(myid int, myyear int, mycol int, mdate timestamp);
INSERT INTO t1 VALUES
(11833,2022,1059,'2022-11-03 22:02:00'),(11834,2022,1059,'2022-11-17 19:56:41'),(11832,2021,1058,'2021-11-16 16:38:21'),(11839,2021,1057,'2021-11-10 18:08:09'),(11847,2021,1055,'2022-05-31 12:13:11'),(11847,2021,1055,'2022-05-31 12:13:11'),(11850,2021,1049,'2021-09-29 16:11:31'),(11853,2021,1046,'2022-01-24 11:44:41'),(11855,2021,1045,'2022-01-24 11:38:05'),(11865,2021,1044,'2022-01-24 11:23:51'),(11856,2021,1043,'2022-01-24 11:00:24'),(11840,2021,1042,'2021-11-30 12:28:13'),(11831,2021,1042,'2021-11-30 12:22:30'),(11846,2022,1042,'2022-11-02 15:06:00'),(11829,2022,1036,'2022-11-02 02:37:00'),(11826,2021,1035,'2021-09-24 13:07:48'),(11825,2021,1034,'2021-10-06 08:22:23'),(11830,2022,1033,'2022-11-03 21:18:00'),(11827,2022,1033,'2022-11-15 21:46:04'),(11828,2022,1032,'2022-11-08 16:44:08'),(11824,2022,1031,'2022-10-25 18:09:03'),(11823,2022,1031,'2022-11-02 03:10:00'),(11822,2022,1030,'2022-10-24 14:59:25')

  ;

    CREATE TABLE t2(myid int, name varchar,idate timestamp);
INSERT INTO t2 VALUES
(11833,'Name1684','2023-01-10 15:52:55'),(11834,'Name1727','2023-01-10 15:52:55'),(11832,'Name609','2023-01-10 15:52:54'),(11839,'Name608','2023-01-10 15:52:59'),(11847,'Name606','2023-01-10 15:53:03'),(11847,'Name607','2023-01-10 15:53:03'),(11850,'Name605','2023-01-10 15:53:04'),(11853,'Name604','2023-01-10 15:53:05'),(11855,'Name603','2023-01-10 15:53:06'),(11865,'Name602','2023-01-10 15:53:10'),(11856,'Name601','2023-01-10 15:53:07'),(11840,'Name600','2023-01-10 15:52:59'),(11831,'Name1726','2023-01-10 15:52:53'),(11846,'Name1683','2023-01-10 15:53:03'),(11829,'Name1682','2023-01-10 15:52:52'),(11826,'Name599','2023-01-10 15:52:50'),(11825,'Name598','2023-01-10 15:52:49'),(11830,'Name1681','2023-01-10 15:52:52'),(11827,'Name1725','2023-01-10 15:52:51'),(11828,'Name1680','2023-01-10 15:52:51'),(11824,'Name1678','2023-01-10 15:52:48'),(11823,'Name1679','2023-01-10 15:52:48'),(11822,'Name1677','2023-01-10 15:52:47')

;

Show example which is not working before order and distinct:

Select 
*
from t1
join t2
on t1.myid=t2.myid where t1.mycol =1059

=> Gives me this result:

myid	myyear	mycol	mdate	myid	name	idate
11833	2022	1059	2022-11-03 22:02:00	11833	Name1684	2023-01-10 15:52:55
11834	2022	1059	2022-11-17 19:56:41	11834	Name1727	2023-01-10 15:52:55

I want to order first by column mdate, then by idate (both to see the youngest dates) and then see only distinct values of (myyear and mycol)

CREATE TABLE expectedresult(myid int, myyear int,mycol int, mdate timestamp,name varchar,idate timestamp);
INSERT INTO expectedresult VALUES
(11834,2022,1059,'2022-11-17 19:56:41','Name1727','2023-01-10 15:52:55')

myid	myyear	mycol	mdate	name	idate
11834	2022	1059	2022-11-17 19:56:41	Name1727	2023-01-10 15:52:55

This is what I have tried:

create table t3 as(
select distinct on (subq1.myyear,subq1.mycol)
  *
  from(
Select 
t1.myid,
t1.myyear,
t1.mycol,
t1.mdate,
t2.name,
t2.idate
from t1
join t2
on t1.myid=t2.myid
order by t1.mdate desc, t2.idate desc) subq1)

But it "distincts" the wrong row(because a younger mdate is available):

select * from t3 where mycol =1059

myid	myyear	mycol	mdate	name	idate
11833	2022	1059	2022-11-03 22:02:00	Name1684	2023-01-10 15:52:55

here also as fiddle: https://dbfiddle.uk/eS5FoBeq

Best

CodePudding user response：

SELECT DISTINCT ON (t1.myyear, t1.mycol)
    *
FROM
    t1
    JOIN t2 ON t1.myid = t2.myid
ORDER BY
    t1.myyear,
    t1.mycol,
    t1.mdate DESC,
    t2.idate DESC;

or rewrite your query as:

SELECT DISTINCT ON (subq1.myyear, subq1.mycol)
    *
FROM (
    SELECT
        t1.myid,
        t1.myyear,
        t1.mycol,
        t1.mdate,
        t2.name,
        t2.idate
    FROM
        t1
        JOIN t2 ON t1.myid = t2.myid
    ORDER BY
        t1.mdate DESC,
        t2.idate DESC) subq1
ORDER BY
    subq1.myyear,
    subq1.mycol,
    subq1.mdate DESC,
    subq1.idate DESC;

if you distinct on (x,y) then you order by should be order by x,y,z x, y is the columns that you want to get the unique row.In a group set (x,y), there are many rows, but you only want one, then you need order by z to get the only one row in a group set (x,y) in a deterministic way, otherwise, it will get a random row in a group set(x,y).

CodePudding user response：

In general I try to avoid using distinct. You can use row number to identify the number of elements with the same "myyear" and "mycol" and order them by newest date and then select the first value (rn =1).

    with cte as(
    Select 
    t1.myid,
    t1.myyear,
    t1.mycol,
    t1.mdate,
    t2.name,
    t2.idate,
    ROW_NUMBER() OVER (PARTITION BY myyear, mycol ORDER BY mdate DESC) as rn
    from t1
    join t2
    on t1.myid=t2.myid
     subq1)
    ) 
    Select * 
    from cte 
where rn = 1