Select multiple 'latest' rows from single table of snapshots, one per story snapshotted-CodePudding

I have a table with multiple snapshots of analytics data for multiple stories. The data is stored along with the timestamp it was taken, and the story_id the data is referring to.

id: integer auto_increment
story_id: string
timestamp: datetime
value: number

and I need to pull out the latest value for each story (i.e. each unique storyId) in a list of ids.

I've written a query, but it scales catastrophically.

SELECT story_id, value
FROM table
WHERE story_id IN ('1','2','3')
AND id = (SELECT id
          FROM table inner
          WHERE inner.story_id = table.story_id
          ORDER BY timestamp DESC
          LIMIT 1)

What's a more efficient way to make this query?

Nice to know:

story_id has to be a string, it's from an external data source
story_id and timestamp already have indexes
there are 2.9M rows and counting...

CodePudding user response：

This is a good case for order by - controlled distinct on.

select DISTINCT ON (story_id) 
       story_id, "value"
from the_table
where story_id in ('1','2','3')
ORDER BY story_id, "timestamp" desc;

An index on story_id, timestamp as @wildplasser suggests will make it scale well.

CodePudding user response：

You can try it like this, does it improve your performance?

SELECT 
  story_id, 
  value
FROM (
      SELECT 
        story_id, 
        timestamp,
        MAX(timestamp) OVER (PARTITION BY story_id) AS max_timestamp_per_id,
        value
      FROM table)
WHERE timestamp = max_timestamp_per_id

CodePudding user response：

A equivalent rewrite of your query using analytic function to get the last ID per story_id in the timestamp order is as below

with last_snap as (
select
story_id, value,
row_number() over (partition by story_id order by timestamp desc)  as rn
from tab
)
select story_id, value
from last_snap
where rn = 1;

This may work bettwer than you subquery solution, but will also not scale for data larger that few 100K rows (better, it will scale the same way as sorting of your full data).

The correct setup for a table with multiple snapshots is a partitioned table containing one partition for each snapshot.

The query selects only the data in the last snapshot (snap_id = nnn) skipping all older data.

CodePudding user response：

Use the proper table definition, including natural key on (story_id, ztimestamp) .
BTW timestamp is a data type, better not use it as a column name.
BTW2: you probably want story_id to be an integer field in stead of a text field, and since it is a key field you may also want it to be NOT NULL.

-- DDL
DROP TABLE story CASCADE;
CREATE TABLE story
        ( id serial not null primary key
        , story_id text NOT NULL
        , ztimestamp timestamp not null
        , zvalue integer not null default 0
        , UNIQUE (story_id, ztimestamp) -- the natural key
        );


\d  story


EXPLAIN
SELECT * FROM story st
WHERE story_id IN('1','2','3')
AND NOT EXISTS(
        SELECT *
        FROM story nx
        WHERE nx.story_id = st.story_id
        AND nx.ztimestamp > st.ztimestamp
        );

DROP TABLE
CREATE TABLE
                                                              Table "tmp.story"
   Column   |            Type             | Collation | Nullable |              Default              | Storage  | Stats target | Description 
------------ ----------------------------- ----------- ---------- ----------------------------------- ---------- -------------- -------------
 id         | integer                     |           | not null | nextval('story_id_seq'::regclass) | plain    |              | 
 story_id   | text                        |           | not null |                                   | extended |              | 
 ztimestamp | timestamp without time zone |           | not null |                                   | plain    |              | 
 zvalue     | integer                     |           | not null | 0                                 | plain    |              | 
Indexes:
    "story_pkey" PRIMARY KEY, btree (id)
    "story_story_id_ztimestamp_key" UNIQUE CONSTRAINT, btree (story_id, ztimestamp)

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=1.83..18.97 rows=13 width=48)
   ->  Bitmap Heap Scan on story st  (cost=1.67..10.94 rows=16 width=48)
         Recheck Cond: (story_id = ANY ('{1,2,3}'::text[]))
         ->  Bitmap Index Scan on story_story_id_ztimestamp_key  (cost=0.00..1.67 rows=16 width=0)
               Index Cond: (story_id = ANY ('{1,2,3}'::text[]))
   ->  Index Only Scan using story_story_id_ztimestamp_key on story nx  (cost=0.15..0.95 rows=2 width=40)
         Index Cond: ((story_id = st.story_id) AND (ztimestamp > st.ztimestamp))
(7 rows)