I have a table with multiple snapshots of analytics data for multiple stories. The data is stored along with the timestamp it was taken, and the story_id the data is referring to.
id: integer auto_increment
story_id: string
timestamp: datetime
value: number
and I need to pull out the latest value
for each story (i.e. each unique storyId) in a list of ids.
I've written a query, but it scales catastrophically.
SELECT story_id, value
FROM table
WHERE story_id IN ('1','2','3')
AND id = (SELECT id
FROM table inner
WHERE inner.story_id = table.story_id
ORDER BY timestamp DESC
LIMIT 1)
What's a more efficient way to make this query?
Nice to know:
- story_id has to be a string, it's from an external data source
- story_id and timestamp already have indexes
- there are 2.9M rows and counting...
CodePudding user response:
This is a good case for order by
- controlled distinct on.
select DISTINCT ON (story_id)
story_id, "value"
from the_table
where story_id in ('1','2','3')
ORDER BY story_id, "timestamp" desc;
An index on story_id, timestamp
as @wildplasser suggests will make it scale well.
CodePudding user response:
You can try it like this, does it improve your performance?
SELECT
story_id,
value
FROM (
SELECT
story_id,
timestamp,
MAX(timestamp) OVER (PARTITION BY story_id) AS max_timestamp_per_id,
value
FROM table)
WHERE timestamp = max_timestamp_per_id
CodePudding user response:
A equivalent rewrite of your query using analytic function to get the last ID
per story_id
in the timestamp
order is as below
with last_snap as (
select
story_id, value,
row_number() over (partition by story_id order by timestamp desc) as rn
from tab
)
select story_id, value
from last_snap
where rn = 1;
This may work bettwer than you subquery solution, but will also not scale for data larger that few 100K rows (better, it will scale the same way as sorting of your full data).
The correct setup for a table with multiple snapshots is a partitioned table containing one partition for each snapshot.
The query selects only the data in the last snapshot (snap_id = nnn
) skipping all older data.
CodePudding user response:
Use the proper table definition, including natural key on
(story_id, ztimestamp)
.BTW
timestamp
is a data type, better not use it as a column name.BTW2: you probably want
story_id
to be an integer field in stead of a text field, and since it is a key field you may also want it to be NOT NULL.
-- DDL
DROP TABLE story CASCADE;
CREATE TABLE story
( id serial not null primary key
, story_id text NOT NULL
, ztimestamp timestamp not null
, zvalue integer not null default 0
, UNIQUE (story_id, ztimestamp) -- the natural key
);
\d story
EXPLAIN
SELECT * FROM story st
WHERE story_id IN('1','2','3')
AND NOT EXISTS(
SELECT *
FROM story nx
WHERE nx.story_id = st.story_id
AND nx.ztimestamp > st.ztimestamp
);
DROP TABLE
CREATE TABLE
Table "tmp.story"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
------------ ----------------------------- ----------- ---------- ----------------------------------- ---------- -------------- -------------
id | integer | | not null | nextval('story_id_seq'::regclass) | plain | |
story_id | text | | not null | | extended | |
ztimestamp | timestamp without time zone | | not null | | plain | |
zvalue | integer | | not null | 0 | plain | |
Indexes:
"story_pkey" PRIMARY KEY, btree (id)
"story_story_id_ztimestamp_key" UNIQUE CONSTRAINT, btree (story_id, ztimestamp)
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Nested Loop Anti Join (cost=1.83..18.97 rows=13 width=48)
-> Bitmap Heap Scan on story st (cost=1.67..10.94 rows=16 width=48)
Recheck Cond: (story_id = ANY ('{1,2,3}'::text[]))
-> Bitmap Index Scan on story_story_id_ztimestamp_key (cost=0.00..1.67 rows=16 width=0)
Index Cond: (story_id = ANY ('{1,2,3}'::text[]))
-> Index Only Scan using story_story_id_ztimestamp_key on story nx (cost=0.15..0.95 rows=2 width=40)
Index Cond: ((story_id = st.story_id) AND (ztimestamp > st.ztimestamp))
(7 rows)