Get distinct values from multiple tables efficiently-CodePudding

I have the following tables in a Postgres database:

Table folders

| id  | name      |
|-----|-----------|
| 1   | folder A  |
| 2   | folder B  |

Table files -- Represents the files in folders (large table)

| id  | folder_id |    
|-----|-----------|
| 1   | 1         |
| 2   | 1         |
| 3   | 2         |

Table metadata_one -- Represents some info relating to files (large table)

| id  | file_id | label |    
|-----|---------|-------|
| 1   | 1       | abc   |
| 2   | 1       | def   |
| 3   | 2       | abc   |

Table metadata_two -- Represents some other info relating to files (large table)

| id  | file_id | label |    
|-----|---------|-------|
| 1   | 1       | abc   |
| 2   | 1       | def   |
| 3   | 2       | abc   |

How can I get a list of distinct label values on the folder level?

Desired result

Only distinct label values, across both metadata tables

| name         | labels    |
|--------------|-----------|
| folder A     | abc,def   |
| folder B     | abc       |

Attempt

Currently this is what I do:

SELECT
    folders.name,
    string_agg(m1.label, ',') AS m1_labels,
    string_agg(m2.label, ',') AS m2_labels
FROM
    folders
JOIN files ON
    files.folder_id = folders.id
JOIN metadata_one m1 ON
    m1.file_id = files.id
JOIN metadata_two m2 ON
    m2.file_id = files.id
GROUP BY
    folder.name

But this gives me the following:

| name         | m1_labels | m2_labels |    
|--------------|-----------|-----------|
| folder A     | abc,def   | abc,def   |
| folder B     | abc       | abc       |

I am looking for an optimised solution, since the files and metadata tables can be very large.

CodePudding user response：

You can generate a UNION of metadata_one and metadata_two in a temporary table and then do your string aggregation like this

WITH metadata_by_folder AS (
    SELECT
        folders.name,
        m1.label AS label
    FROM
        folders
    JOIN files ON
        files.folder_id = folders.id
    JOIN metadata_one m1 ON
        m1.file_id = files.id
    UNION SELECT
        folders.name,
        m2.label AS label
    FROM
        folders
    JOIN files ON
        files.folder_id = folders.id
    JOIN metadata_two m2 ON
        m2.file_id = files.id
)
SELECT
    metadata_by_folder.name,
    string_agg(metadata_by_folder.label, ',') AS labels
FROM
    metadata_by_folder
GROUP BY
    metadata_by_folder.name;