PG SQL UNION With ORDER BY, LIMIT Performance Optimization-CodePudding

I am trying to execute a query with an ORDER BY clause and a LIMIT clause for performance. Consider the following schema.

ONE
(id, name)
(1 , a)
(2 , b)
(5 , c)

TWO
(id, name)
(3 , d)
(4 , e)
(5 , f)

I want to be able to get a list of people from tables one and two ordered by ID.

The current query I have is as follows.

WITH combined AS (
(SELECT * FROM one ORDER BY id DESC)
UNION ALL
(SELECT * FROM two ORDER BY id DESC)
)
SELECT * FROM combined ORDER BY id LIMIT 5

the output of the table will be

(id, name)
(1 , a)
(2 , b)
(3 , d)
(4 , e)
(5 , c)

You'll notice that last row "c" or "f" will change based on the order of the UNION (one UNION two versus two UNION one). That's not important as I only care about the order for ID.

Unfortunately, this query does a full scan of both tables as per the ORDER BY on "combined". My table one and two are both billions of rows.

I am looking for a query that will be able to search both tables simultaneously, if possible. Meaning rather than looking through all of "one" for the entries that I need, it first looks to sort both by ID and then look for the minimum from both tables such that if the ID in one table is lower than the ID in another table, the query will look in the other table until the other table's ID is higher or equal to the first table before looking through the first table again.

The correct order of reading the table, given one UNION two would be a, b, d, e, c/f.

CodePudding user response：

Do you just mean this?

WITH combined AS (
(SELECT * FROM one ORDER BY id LIMIT 5)
UNION ALL
(SELECT * FROM two ORDER BY id LIMIT 5)
)
SELECT * FROM combined ORDER BY id LIMIT 5

That will select the 5 "lowest id" rows from each table (which is the minimum you need to guarantee 5 output rows) and then find the lowest of those.

CodePudding user response：

Thanks to a_horse_with_no_name's comment on Richard Huxton's answer regarding adding an index, the query runs considerably faster, from indeterminate to under one minute.

In my case, the query was still too slow, and I came across the following solution.

Consider using results from one table to limit results from another table. The following solution, in combination with indexing by id, worked for my tables with billions of rows, but operates on the assumption that table "one" is faster than table "two" to finish the query.

WITH first as (SELECT * FROM one ORDER BY id LIMIT 5),
     filter as (SELECT min(id) FROM first),
     second as (SELECT * FROM two 
                WHERE id < (SELECT filter.id FROM filter) 
                ORDER BY id LIMIT 5)

combined AS (
(SELECT * FROM first ORDER BY id LIMIT 5)
UNION ALL
(SELECT * FROM second ORDER BY id LIMIT 5)
)
SELECT * FROM combined ORDER BY id LIMIT 5

By using the minimum ID from the first complete query, I can limit the scope that the database scans for completion of the second query.