JOIN with array of ids returns duplicate root records instead of just one-CodePudding

I'm trying to join several tables and pull out each DISTINCT root record (from table_a), but for some reason I keep getting duplicates. Here is my select query:

Fiddle

select
    ta.id,
    ta.table_a_name as "tableName"
from my_schema.table_a ta
left join my_schema.table_b tb
    on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
    on (tc.table_b_id = tb.id)
left join my_schema.table_d td
    on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);

This returns the following:

[
    {
        "id": 2,
        "tableName": "Root record 2"
    },
    {
        "id": 2,
        "tableName": "Root record 2"
    }
]

But I am only expecting, in this case,

[
    {
        "id": 2,
        "tableName": "Root record 2"
    }
]

What am I doing wrong here?

Here's the fiddle and, just in case, the create and insert statements below:

create schema if not exists my_schema;

create table if not exists my_schema.table_a (
    id serial primary key,
    table_a_name varchar (255) not null
);

create table if not exists my_schema.table_b (
    id serial primary key,
    table_a_id bigint not null references my_schema.table_a (id)
);

create table if not exists my_schema.table_d (
    id serial primary key
);

create table if not exists my_schema.table_c (
    id serial primary key,
    table_b_id bigint not null references my_schema.table_b (id),
    table_d_ids bigint[] not null
);

insert into my_schema.table_a values
    (1, 'Root record 1'),
    (2, 'Root record 2'),
    (3, 'Root record 3');

insert into my_schema.table_b values
    (10, 2),
    (11, 2),
    (12, 3);

insert into my_schema.table_d values
    (100),
    (101),
    (102),
    (103),
    (104);

insert into my_schema.table_c values
    (1000, 10, array[]::int[]),
    (1001, 10, array[100]),
    (1002, 11, array[100, 101]),
    (1003, 12, array[102]),
    (1004, 12, array[103]);

CodePudding user response：

Short answer is use distinct, and this will get the results you want:

select distinct
    ta.id,
    ta.table_a_name as "tableName"
from my_schema.table_a ta
left join my_schema.table_b tb
    on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
    on (tc.table_b_id = tb.id)
left join my_schema.table_d td
    on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);

That said, this doesn't sit well with me because I assume this is not the end of your query.

The root issue is that you have two records from table_b - table_d that match this criteria. If you follow the breadcrumbs back, you will see there really are two matches:

select 
    ta.id,
    ta.table_a_name as "tableName", tb.*, tc.*, td.*
from my_schema.table_a ta
left join my_schema.table_b tb
    on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
    on (tc.table_b_id = tb.id)
left join my_schema.table_d td
    on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);

So 'distinct' is just a lazy fix to say if there are dupes, limit it to one...

My next question is, is there more to it than this? What's supposed to happen next? Do you really just want candidates from table_a, or is this part 1 of a longer issue? If there is more to it, then there is likely a better solution than a simple select distinct.

-- edit 10/1/2022 --

Based on your comment, I have one final suggestion. Because this really all there is to your output AND you don't actually need the data from the b/c/d tables, then I think a semi-join is a better solution.

It's slightly more code (not going to win any golf or de-obfuscation contents), but it's much more efficient than a distinct or group by all columns. The reason is a distinct pulls every row result and then has to order and remove dupes. A semi-join, by contrast, will "stop looking" once it finds a match. It also scales very well. Almost every time I see a distinct misused, it's better served by a semi-join.

select
    ta.id,
    ta.table_a_name as "tableName"
from my_schema.table_a ta
where exists (
  select null
  from
    table_b tb,
    table_c tc,
    table_d tc
   where
    tb.table_a_id = ta.id and
    tc.table_b_id = tb.id and
    td.id = any(tc.table_d_ids) and
    td.id = any(array[100])
)

I didn't suggest this initially because I was unclear on the "what next."