For example, we have table:
id | value | relation_id |
---|---|---|
1 | value1 | 1 |
2 | value2 | 2 |
3 | value3 | 1 |
4 | value4 | 1 |
5 | value5 | 3 |
And I want get rows with ids 1, 2, 5, for example (because they have distinct relation_id values). It is easy, but not if you have several billions rows. It is pretty slow even on SSD drive. I have each possible value for relation_id and I tried such query:
(select value, relation_id from table where relation_id=2 limit 1)
union
(select value, relation_id from table where relation_id=3 limit 1)
-- so on
But turns out that for each subquery postgresql looking in the table from begining despite that it is one query. Is there a way to write query in such manner, that postgresql make single pass over table and collect required data along the way?
CodePudding user response:
Approach 1: Use PostgreSQL DISTINCT ON
operator. It will match only distinct values of the field(s) inside the parentheses, given an ORDER BY
clause, to prevent duplicates.
SELECT DISTINCT ON(relation_id) id_, value_, relation_id
FROM tab
ORDER BY relation_id
Check the demo here.
Approach 2: Use the ROW_NUMBER
window function to generate a ranking over your records partitioned per "relation_id", then select the first record for each relation (rownum = 1)
WITH cte AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY relation_id ORDER BY id_) AS rn
FROM tab
)
SELECT id_, value_, relation_id
FROM cte
WHERE rn = 1
Check the demo here.
Approach 3: Use ROW_NUMBER
inside the filtering construct FETCH FIRST ROWS WITH TIES
, which will act as Approach 2 but avoids you the subquery (gets tied rows, tied on rownum=1).
SELECT *
FROM tab
ORDER BY ROW_NUMBER() OVER(PARTITION BY relation_id ORDER BY id_) = 1 DESC
FETCH FIRST 1 ROWS WITH TIES
Check the demo here.