Query Value by Max Date in Postgresql-CodePudding

I already asked this question here but there contained less information about my question. So, I create a new question with more information.

Here is the sample table that I have. Each row contains the filled data by the user at every time. So that the timestamp column will not be null through the whole table. There may be unrecorded value under item, if the user didn't fill. The id is the auto-generated column for each record.

CREATE TABLE tbl (id int, customer_id text, item text, value text, timestamp timestamp);    
INSERT INTO tbl VALUES
(1, '001', 'price', '1000', '2021-11-01 01:00:00'),
(2, '001', 'price', '1500', '2021-11-02 01:00:00'),
(3, '001', 'price', '1400', '2021-11-03 01:00:00'),
(4, '001', 'condition', 'good', '2021-11-01 01:00:00'),
(5, '001', 'condition', 'good', '2021-11-02 01:00:00'),
(6, '001', 'condition', 'ok', '2021-11-03 01:00:00'),
(7, '001', 'feeling', 'sad', '2021-11-01 01:00:00'),
(8, '001', 'feeling', 'angry', '2021-11-02 01:00:00'),
(9, '001', 'feeling', 'fine', '2021-11-03 01:00:00'),
(10, '002', 'price', '1200', '2021-11-01 01:00:00'),
(11, '002', 'price', '1600', '2021-11-02 01:00:00'),
(12, '002', 'price', '2000', '2021-11-03 01:00:00'),
(13, '002', 'weather', 'sunny', '2021-11-01 01:00:00'),
(14, '002', 'weather', 'rain', '2021-11-02 01:00:00'),
(15, '002', 'price', '1900', '2021-11-04 01:00:00'),
(16, '002', 'feeling', 'sad', '2021-11-01 01:00:00'),
(17, '002', 'feeling', 'angry', '2021-11-02 01:00:00'),
(18, '002', 'feeling', 'fine', '2021-11-03 01:00:00'),
(19, '003', 'price', '1000', '2021-11-01 01:00:00'),
(20, '003', 'price', '1500', '2021-11-02 01:00:00'),
(21, '003', 'price', '2000', '2021-11-03 01:00:00'),
(22, '003', 'condition', 'ok', '2021-11-01 01:00:00'),
(23, '003', 'weather', 'rain', '2021-11-02 01:00:00'),
(24, '003', 'condition', 'bad', '2021-11-03 01:00:00'),
(25, '003', 'feeling', 'fine', '2021-11-01 01:00:00'),
(26, '003', 'weather', 'sunny', '2021-11-03 01:00:00'),
(27, '003', 'feeling', 'sad', '2021-11-03 01:00:00')
;

To see clearly, I order the above table by id and timestamp. It doesn't matter.

We are using Postgresql Version: PostgreSQL 9.5.19
The actual table contains over 4 million rows
The item column contains over 500 distinct items, but don't worry. I will use 10 items at most for a query. In the above table, I used only 4 items.
We also have another table called Customer_table with a unique Customer_id containing customers' general information.

From the above table, I want to query the data to create a table with the latest date updated data as below. I will use 10 items at most for a query so that there may be 10 columns.

customer_id  price  condition  feeling   weather .......(there may be other columns from item column)
   002        1900    null      fine      rain
   001        1400     ok       fine      null
   003        2000    bad       sad       sunny

This is the query that I get from previous questions, but I asked only for two item.

SELECT customer_id, p.value AS price, c.value AS condition
FROM  (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'condition'
   ORDER  BY customer_id, timestamp DESC
   ) c
FULL JOIN (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'price'
   ORDER  BY customer_id, timestamp DESC
   ) p USING (customer_id)

So, if there is any better solution please help me. Thank you.

CodePudding user response：

You may try other approaches using row_number to generate a value to filter your data on the most recent data. You may then aggregate on customer id with the max value for a case expression filtering your records based on the desired row number rn=1 (we will order by descending) and item name.

These approaches are less verbose and based on the results online seem to be more performant. Let me know how replicating this in your environment works in the comments.

You may use EXPLAIN ANALYZE to compare this approach to the current one. The results in the online environment provided:

Current Approach

| Planning time: 0.129 ms                                                                                                      
| Execution time: 0.056 ms

Suggested Approach 1

| Planning time: 0.061 ms                                                                                                 
| Execution time: 0.070 ms

Suggested Approach 2

| Planning time: 0.047 ms                                                                                                 
| Execution time: 0.056 ms

NB. You may use EXPLAIN ANALYZE to compare these approaches in your environment which we cannot replicate online. The results may also vary on each run. Indexes and early filters on the item column are also recommended to improve performance.

Schema (PostgreSQL v9.5)

Suggested Approach 1

SELECT
    t1.customer_id,
    MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio,
    MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
    MAX(CASE WHEN t1.item='feeling' THEN t1.value END) as feeling,
    MAX(CASE WHEN t1.item='weather' THEN t1.value END) as weather
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    -- ensure that you filter based on your desired items
    -- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
   1;

customer_id	conditio	price	feeling	weather
001	ok	1400	fine
002		1900	fine	rain
003	bad	2000	sad	sunny

Suggested Approach 2

SELECT
    t1.customer_id,
    MAX(t1.value) FILTER (WHERE  t1.item='condition')  as conditio,
    MAX(t1.value) FILTER (WHERE  t1.item='price')  as price,
    MAX(t1.value) FILTER (WHERE  t1.item='feeling')  as feeling,
    MAX(t1.value) FILTER (WHERE  t1.item='weather')  as weather
    
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    -- ensure that you filter based on your desired items
    -- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
   1;

customer_id	conditio	price	feeling	weather
001	ok	1400	fine
002		1900	fine	rain
003	bad	2000	sad	sunny

Current Approach with EXPLAIN ANALYZE

EXPLAIN(ANALYZE,BUFFERS)
SELECT customer_id, p.value AS price, c.value AS condition
FROM  (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'condition'
   ORDER  BY customer_id, timestamp DESC
   ) c
FULL JOIN (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'price'
   ORDER  BY customer_id, timestamp DESC
   ) p USING (customer_id);

QUERY PLAN
Merge Full Join (cost=35.05..35.12 rows=1 width=128) (actual time=0.025..0.030 rows=3 loops=1)
Merge Cond: (tbl.customer_id = tbl_1.customer_id)
Buffers: shared hit=2
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.013..0.014 rows=2 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.013..0.013 rows=5 loops=1)
Sort Key: tbl.customer_id, tbl."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=3 width=72) (actual time=0.004..0.006 rows=5 loops=1)
Filter: (item = 'condition'::text)
Rows Removed by Filter: 22
Buffers: shared hit=1
-> Materialize (cost=17.52..17.55 rows=1 width=64) (actual time=0.010..0.013 rows=3 loops=1)
Buffers: shared hit=1
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.010..0.012 rows=3 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.010..0.010 rows=10 loops=1)
Sort Key: tbl_1.customer_id, tbl_1."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl tbl_1 (cost=0.00..17.50 rows=3 width=72) (actual time=0.001..0.003 rows=10 loops=1)
Filter: (item = 'price'::text)
Rows Removed by Filter: 17
Buffers: shared hit=1
Planning time: 0.129 ms
Execution time: 0.056 ms

Suggested Approach 1 with EXPLAIN ANALYZE

EXPLAIN(ANALYZE,BUFFERS)
SELECT
    t1.customer_id,
    MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
    MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio
    
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
   1;

QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.039..0.047 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.030..0.040 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.029..0.038 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.028..0.030 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.061 ms
Execution time: 0.070 ms

Suggested Approach 2 with EXPLAIN ANALYZE

EXPLAIN(ANALYZE,BUFFERS)
SELECT
    t1.customer_id,
    MAX(t1.value) FILTER (WHERE  t1.item='price')  as price,
    MAX(t1.value) FILTER (WHERE  t1.item='condition')  as conditio
    
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
   1;

QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.029..0.037 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.021..0.032 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.021..0.030 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.019..0.021 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.047 ms
Execution time: 0.056 ms

View working demo on DB Fiddle

CodePudding user response：

You operate on a big table. You mentioned 4 million rows, obviously growing. While querying for ...

all customers
all items
with few rows per (customer_id, item)
with narrow rows (small row size)

... ggordon's solutions with row_number() are great. And short, too.
The whole table has to be processed in a sequential scan. Indices won't be used.
But prefer "Approach 2" with the modern aggregate FILTER syntax. It's clearer and faster. See performance tests here:

For absolute performance, is SUM faster or COUNT?

Approach 3: Pivot with `crosstab()`

crosstab() is typically faster, especially for more than a few items. See:

PostgreSQL Crosstab Query

SELECT *
FROM   crosstab(
   $$
   SELECT customer_id, item, value
   FROM  (
      SELECT customer_id, item, value
           , row_number() OVER (PARTITION BY customer_id, item ORDER BY t.timestamp DESC) AS rn
      FROM   tbl t
      WHERE  item = ANY ('{condition,price,feeling,weather}')  -- your items here ...
      ) t1
   WHERE  rn = 1
   ORDER  BY customer_id, item
   $$
 , $$SELECT unnest('{condition,price,feeling,weather}'::text[])$$  -- ... here ...
   ) AS ct (customer_id text, condition text, price text, feeling text, weather text);  -- ... and here ...

Approach 4: `LATERAL` Subqueries

If one or more of the criteria listed at the top do not apply, the above queries fall off quickly in performance.

For starters, only max 10 of "500 distinct items" are involved. That's max ~ 2 % of the big table. That alone should make one of the following queries (much) faster in comparison:

SELECT *
FROM  (SELECT customer_id FROM customer) c
LEFT   JOIN LATERAL (
   SELECT value AS condition
   FROM   tbl t
   WHERE  t.customer_id = c.customer_id
   AND    t.item = 'condition'
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) AS t1 ON true
LEFT   JOIN LATERAL (
   SELECT value AS price
   FROM   tbl t
   WHERE  t.customer_id = c.customer_id
   AND    t.item = 'price'
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) AS t2 ON true
LEFT   JOIN LATERAL (
   SELECT value AS feeling
   FROM   tbl t
   WHERE  t.customer_id = c.customer_id
   AND    t.item = 'feeling'
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) AS t3 ON true
--  ... more?

About LEFT JOIN LATERAL:

What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

The point is to get a query plan with relatively few index(-only) scans to replace the expensive sequential scan on the big table.
Requires an applicable index, obviously:

CREATE INDEX ON tbl (customer_id, item);

Or better (in Postgres 9.5):

CREATE INDEX ON tbl (customer_id, item, timestamp DESC, value);

In Postgres 11 or later, this would be better, yet:

CREATE INDEX ON tbl (customer_id, item, timestamp DESC) INCLUDE (value);

See here or here or here.

If only few items are of interest, partial indices on those items would be even better.

Approach 5: Correlated Subqueries

SELECT c.customer_id
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'condition' ORDER BY t.timestamp DESC LIMIT 1) AS condition
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'price'     ORDER BY t.timestamp DESC LIMIT 1) AS price
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'feeling'   ORDER BY t.timestamp DESC LIMIT 1) AS feeling
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'weather'   ORDER BY t.timestamp DESC LIMIT 1) AS weather
FROM   customer c;

Not as versatile as LATERAL, but good enough for the purpose. Same index requirements as approach 4.

Approach 5 will be king of performance in most cases.

db<>fiddle here

Improving your relational design and/or upgrading to a current version of Postgres would go a long way, too.

Approach 3: Pivot with crosstab()

Approach 4: LATERAL Subqueries

Approach 5: Correlated Subqueries

Approach 3: Pivot with `crosstab()`

Approach 4: `LATERAL` Subqueries