I already asked this question here but there contained less information about my question. So, I create a new question with more information.
Here is the sample table that I have. Each row contains the filled data by the user at every time. So that the timestamp column will not be null through the whole table. There may be unrecorded value under item, if the user didn't fill. The id is the auto-generated column for each record.
CREATE TABLE tbl (id int, customer_id text, item text, value text, timestamp timestamp);
INSERT INTO tbl VALUES
(1, '001', 'price', '1000', '2021-11-01 01:00:00'),
(2, '001', 'price', '1500', '2021-11-02 01:00:00'),
(3, '001', 'price', '1400', '2021-11-03 01:00:00'),
(4, '001', 'condition', 'good', '2021-11-01 01:00:00'),
(5, '001', 'condition', 'good', '2021-11-02 01:00:00'),
(6, '001', 'condition', 'ok', '2021-11-03 01:00:00'),
(7, '001', 'feeling', 'sad', '2021-11-01 01:00:00'),
(8, '001', 'feeling', 'angry', '2021-11-02 01:00:00'),
(9, '001', 'feeling', 'fine', '2021-11-03 01:00:00'),
(10, '002', 'price', '1200', '2021-11-01 01:00:00'),
(11, '002', 'price', '1600', '2021-11-02 01:00:00'),
(12, '002', 'price', '2000', '2021-11-03 01:00:00'),
(13, '002', 'weather', 'sunny', '2021-11-01 01:00:00'),
(14, '002', 'weather', 'rain', '2021-11-02 01:00:00'),
(15, '002', 'price', '1900', '2021-11-04 01:00:00'),
(16, '002', 'feeling', 'sad', '2021-11-01 01:00:00'),
(17, '002', 'feeling', 'angry', '2021-11-02 01:00:00'),
(18, '002', 'feeling', 'fine', '2021-11-03 01:00:00'),
(19, '003', 'price', '1000', '2021-11-01 01:00:00'),
(20, '003', 'price', '1500', '2021-11-02 01:00:00'),
(21, '003', 'price', '2000', '2021-11-03 01:00:00'),
(22, '003', 'condition', 'ok', '2021-11-01 01:00:00'),
(23, '003', 'weather', 'rain', '2021-11-02 01:00:00'),
(24, '003', 'condition', 'bad', '2021-11-03 01:00:00'),
(25, '003', 'feeling', 'fine', '2021-11-01 01:00:00'),
(26, '003', 'weather', 'sunny', '2021-11-03 01:00:00'),
(27, '003', 'feeling', 'sad', '2021-11-03 01:00:00')
;
To see clearly, I order the above table by id and timestamp. It doesn't matter.
- We are using Postgresql Version: PostgreSQL 9.5.19
- The actual table contains over 4 million rows
- The item column contains over 500 distinct items, but don't worry. I will use 10 items at most for a query. In the above table, I used only 4 items.
- We also have another table called Customer_table with a unique Customer_id containing customers' general information.
From the above table, I want to query the data to create a table with the latest date updated data as below. I will use 10 items at most for a query so that there may be 10 columns.
customer_id price condition feeling weather .......(there may be other columns from item column)
002 1900 null fine rain
001 1400 ok fine null
003 2000 bad sad sunny
This is the query that I get from previous questions, but I asked only for two item.
SELECT customer_id, p.value AS price, c.value AS condition
FROM (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'condition'
ORDER BY customer_id, timestamp DESC
) c
FULL JOIN (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'price'
ORDER BY customer_id, timestamp DESC
) p USING (customer_id)
So, if there is any better solution please help me. Thank you.
CodePudding user response:
You may try other approaches using row_number
to generate a value to filter your data on the most recent data. You may then aggregate on customer id with the max value for a case expression filtering your records based on the desired row number rn=1
(we will order by descending) and item name.
These approaches are less verbose and based on the results online seem to be more performant. Let me know how replicating this in your environment works in the comments.
You may use EXPLAIN ANALYZE
to compare this approach to the current one. The results in the online environment provided:
Current Approach
| Planning time: 0.129 ms
| Execution time: 0.056 ms
Suggested Approach 1
| Planning time: 0.061 ms
| Execution time: 0.070 ms
Suggested Approach 2
| Planning time: 0.047 ms
| Execution time: 0.056 ms
NB. You may use EXPLAIN ANALYZE
to compare these approaches in your environment which we cannot replicate online. The results may also vary on each run. Indexes and early filters on the item
column are also recommended to improve performance.
Schema (PostgreSQL v9.5)
Suggested Approach 1
SELECT
t1.customer_id,
MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio,
MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
MAX(CASE WHEN t1.item='feeling' THEN t1.value END) as feeling,
MAX(CASE WHEN t1.item='weather' THEN t1.value END) as weather
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
-- ensure that you filter based on your desired items
-- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
1;
customer_id | conditio | price | feeling | weather |
---|---|---|---|---|
001 | ok | 1400 | fine | |
002 | 1900 | fine | rain | |
003 | bad | 2000 | sad | sunny |
Suggested Approach 2
SELECT
t1.customer_id,
MAX(t1.value) FILTER (WHERE t1.item='condition') as conditio,
MAX(t1.value) FILTER (WHERE t1.item='price') as price,
MAX(t1.value) FILTER (WHERE t1.item='feeling') as feeling,
MAX(t1.value) FILTER (WHERE t1.item='weather') as weather
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
-- ensure that you filter based on your desired items
-- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
1;
customer_id | conditio | price | feeling | weather |
---|---|---|---|---|
001 | ok | 1400 | fine | |
002 | 1900 | fine | rain | |
003 | bad | 2000 | sad | sunny |
Current Approach with EXPLAIN ANALYZE
EXPLAIN(ANALYZE,BUFFERS)
SELECT customer_id, p.value AS price, c.value AS condition
FROM (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'condition'
ORDER BY customer_id, timestamp DESC
) c
FULL JOIN (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'price'
ORDER BY customer_id, timestamp DESC
) p USING (customer_id);
QUERY PLAN |
---|
Merge Full Join (cost=35.05..35.12 rows=1 width=128) (actual time=0.025..0.030 rows=3 loops=1) |
Merge Cond: (tbl.customer_id = tbl_1.customer_id) |
Buffers: shared hit=2 |
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.013..0.014 rows=2 loops=1) |
Buffers: shared hit=1 |
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.013..0.013 rows=5 loops=1) |
Sort Key: tbl.customer_id, tbl."timestamp" DESC |
Sort Method: quicksort Memory: 25kB |
Buffers: shared hit=1 |
-> Seq Scan on tbl (cost=0.00..17.50 rows=3 width=72) (actual time=0.004..0.006 rows=5 loops=1) |
Filter: (item = 'condition'::text) |
Rows Removed by Filter: 22 |
Buffers: shared hit=1 |
-> Materialize (cost=17.52..17.55 rows=1 width=64) (actual time=0.010..0.013 rows=3 loops=1) |
Buffers: shared hit=1 |
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.010..0.012 rows=3 loops=1) |
Buffers: shared hit=1 |
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.010..0.010 rows=10 loops=1) |
Sort Key: tbl_1.customer_id, tbl_1."timestamp" DESC |
Sort Method: quicksort Memory: 25kB |
Buffers: shared hit=1 |
-> Seq Scan on tbl tbl_1 (cost=0.00..17.50 rows=3 width=72) (actual time=0.001..0.003 rows=10 loops=1) |
Filter: (item = 'price'::text) |
Rows Removed by Filter: 17 |
Buffers: shared hit=1 |
Planning time: 0.129 ms |
Execution time: 0.056 ms |
Suggested Approach 1 with EXPLAIN ANALYZE
EXPLAIN(ANALYZE,BUFFERS)
SELECT
t1.customer_id,
MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
1;
QUERY PLAN |
---|
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.039..0.047 rows=3 loops=1) |
Group Key: t1.customer_id |
Buffers: shared hit=1 |
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.030..0.040 rows=5 loops=1) |
Filter: (t1.rn = 1) |
Rows Removed by Filter: 10 |
Buffers: shared hit=1 |
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.029..0.038 rows=15 loops=1) |
Buffers: shared hit=1 |
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.028..0.030 rows=15 loops=1) |
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC |
Sort Method: quicksort Memory: 26kB |
Buffers: shared hit=1 |
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1) |
Filter: (item = ANY ('{price,condition}'::text[])) |
Rows Removed by Filter: 12 |
Buffers: shared hit=1 |
Planning time: 0.061 ms |
Execution time: 0.070 ms |
Suggested Approach 2 with EXPLAIN ANALYZE
EXPLAIN(ANALYZE,BUFFERS)
SELECT
t1.customer_id,
MAX(t1.value) FILTER (WHERE t1.item='price') as price,
MAX(t1.value) FILTER (WHERE t1.item='condition') as conditio
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
1;
QUERY PLAN |
---|
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.029..0.037 rows=3 loops=1) |
Group Key: t1.customer_id |
Buffers: shared hit=1 |
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.021..0.032 rows=5 loops=1) |
Filter: (t1.rn = 1) |
Rows Removed by Filter: 10 |
Buffers: shared hit=1 |
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.021..0.030 rows=15 loops=1) |
Buffers: shared hit=1 |
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.019..0.021 rows=15 loops=1) |
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC |
Sort Method: quicksort Memory: 26kB |
Buffers: shared hit=1 |
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1) |
Filter: (item = ANY ('{price,condition}'::text[])) |
Rows Removed by Filter: 12 |
Buffers: shared hit=1 |
Planning time: 0.047 ms |
Execution time: 0.056 ms |
View working demo on DB Fiddle
CodePudding user response:
You operate on a big table. You mentioned 4 million rows, obviously growing. While querying for ...
- all customers
- all items
- with few rows per
(customer_id, item)
- with narrow rows (small row size)
... ggordon's solutions with row_number()
are great. And short, too.
The whole table has to be processed in a sequential scan. Indices won't be used.
But prefer "Approach 2" with the modern aggregate FILTER
syntax. It's clearer and faster. See performance tests here:
Approach 3: Pivot with crosstab()
crosstab()
is typically faster, especially for more than a few items. See:
SELECT *
FROM crosstab(
$$
SELECT customer_id, item, value
FROM (
SELECT customer_id, item, value
, row_number() OVER (PARTITION BY customer_id, item ORDER BY t.timestamp DESC) AS rn
FROM tbl t
WHERE item = ANY ('{condition,price,feeling,weather}') -- your items here ...
) t1
WHERE rn = 1
ORDER BY customer_id, item
$$
, $$SELECT unnest('{condition,price,feeling,weather}'::text[])$$ -- ... here ...
) AS ct (customer_id text, condition text, price text, feeling text, weather text); -- ... and here ...
Approach 4: LATERAL
Subqueries
If one or more of the criteria listed at the top do not apply, the above queries fall off quickly in performance.
For starters, only max 10 of "500 distinct items" are involved. That's max ~ 2 % of the big table. That alone should make one of the following queries (much) faster in comparison:
SELECT *
FROM (SELECT customer_id FROM customer) c
LEFT JOIN LATERAL (
SELECT value AS condition
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'condition'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t1 ON true
LEFT JOIN LATERAL (
SELECT value AS price
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'price'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t2 ON true
LEFT JOIN LATERAL (
SELECT value AS feeling
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'feeling'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t3 ON true
-- ... more?
About LEFT JOIN LATERAL
:
The point is to get a query plan with relatively few index(-only) scans to replace the expensive sequential scan on the big table.
Requires an applicable index, obviously:
CREATE INDEX ON tbl (customer_id, item);
Or better (in Postgres 9.5):
CREATE INDEX ON tbl (customer_id, item, timestamp DESC, value);
In Postgres 11 or later, this would be better, yet:
CREATE INDEX ON tbl (customer_id, item, timestamp DESC) INCLUDE (value);
If only few items are of interest, partial indices on those items would be even better.
Approach 5: Correlated Subqueries
SELECT c.customer_id
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'condition' ORDER BY t.timestamp DESC LIMIT 1) AS condition
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'price' ORDER BY t.timestamp DESC LIMIT 1) AS price
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'feeling' ORDER BY t.timestamp DESC LIMIT 1) AS feeling
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'weather' ORDER BY t.timestamp DESC LIMIT 1) AS weather
FROM customer c;
Not as versatile as LATERAL
, but good enough for the purpose. Same index requirements as approach 4.
Approach 5 will be king of performance in most cases.
db<>fiddle here
Improving your relational design and/or upgrading to a current version of Postgres would go a long way, too.