Improve Postgresql queries selecting latest entries in groups-CodePudding

I have recorded events from multiple "users" into a single table, over years. Each user has a uid, a timestamp (when) and some other individual properties (e.g. version and language), basically like this:

CREATE TABLE public.events (
    id integer NOT NULL DEFAULT nextval('events_id_seq'::regclass),
    "when" timestamp without time zone,
    uid character(22) COLLATE pg_catalog."default",
    lang character(5) COLLATE pg_catalog."default",
    osv integer,
    daysrun integer,
)

The table has now about 3 million rows, and about 100k unique uid values in it.

In many cases, I want to query the properties of the latest record for each user (uid), i.e. the one with the highest when value. This often takes seconds or even minutes, and I wonder if there are ways to speed this up, in postgresql.

Here's an example of my attempt at querying the longest daysrun value for each user:

SELECT a.* 
FROM events a 
WHERE a.id IN ( SELECT DISTINCT ON (c.uid) c.id 
                FROM events c 
                WHERE c.uid IN ( SELECT DISTINCT m.uid 
                                 FROM events m 
                                 WHERE m.daysrun > 500 ) 
                ORDER BY c.uid, c.daysrun DESC ) 
ORDER BY a.daysrun DESC
LIMIT 10

This takes 11s (on a shared web server, single cpu). If I remove the WHERE clause, it can take minutes, and can bring the server to its knees.

The above example does not even use when for sorting because the daysrun value is equivalent in this case. The nesting is done with the following reasoning:

(inner) Get every unique uid.
(middle) Get every row (their id) that has the highest daysrun value for each uid (here, I could as well have used when in place of daysrun). (Though, I am not even sure that this will always give me the highest one)
(outer) Sort the results by daysrun.

So, what can be improved about this? Is my query wrong, or can I add better indexes, or other ways, maybe a View that makes this more efficient?

In short, how do I select the rows with the highest when value for each uid, so that I can then use those twos for further queries?

CodePudding user response：

You can reduce the amount of inner selects, now you have 3 queries one inside another. You also sort the inner table just to throw it out. You can use aggregate functions to achieve the same result, for example last_value:

select last_value(col1) over (partition by e.uid order by e.daysrun)
from events e
where e.daysrun > 500

I'm using some column col1. If performance is important, it's better to fetch only needed data.

CodePudding user response：

One option here, which might be feasible depending on how up to date your data must be in the queries, and how often you run this logic, is to create a materialized view that contains just the most recent record per uid, and to periodically refresh that.

You run the queries against that instead of the table.

CodePudding user response：

A subquery can be used directly as a table, not just as the input of a IN list. Using it as a table is what you want to do here to allow you to sort one way before the DISTINCT ON and another way after it.

SELECT a.* 
FROM ( SELECT DISTINCT ON (c.uid) * 
                FROM events c  
                ORDER BY c.uid, c.daysrun DESC ) a
ORDER BY a.daysrun DESC
LIMIT 10

Assigning an alias to the subquery is syntactically required, but the alias doesn't need to be used as only one set of columns is in scope at any one time. I find this easier to read and understand:

SELECT * 
FROM ( SELECT DISTINCT ON (uid) * 
                FROM events  
                ORDER BY uid, daysrun DESC ) a
ORDER BY daysrun DESC
LIMIT 10

If this doesn't solve the problem, then you at least need to include an EXPLAIN (ANALYZE, BUFFERS) of it.