PostgreSQL fill in the blanks in an outer join-CodePudding

Outer Join 'fill-in-the blanks'

I have a pair of master-detail tables in a PostgreSQL database where master table 'samples' has some samples with a timestamp in each. The detail table 'sample_values' has some values for some parameters at any given sample timestamp.

My Query

SELECT s.sample_id, s.sample_time, v.parameter_id, v.sample_value
FROM samples s LEFT OUTER JOIN sample_values v ON v.sample_id=s.sample_id
ORDER BY s.sample_id, v.parameter_id;

returns (as expected):

sample_id	sample_time	parameter_id	sample_value
1	2023-01-13T01:00:00.000Z	1	1.23
1	2023-01-13T01:00:00.000Z	2	4.98
2	2023-01-13T01:01:00.000Z
3	2023-01-13T01:02:00.000Z
4	2023-01-13T01:03:00.000Z
5	2023-01-13T01:04:00.000Z	2	6.08
6	2023-01-13T01:05:00.000Z
7	2023-01-13T01:06:00.000Z	1	1.89
8	2023-01-13T01:07:00.000Z
9	2023-01-13T01:08:00.000Z
10	2023-01-13T01:09:00.000Z
11	2023-01-13T01:10:00.000Z
12	2023-01-13T01:11:00.000Z
13	2023-01-13T01:12:00.000Z
14	2023-01-13T01:13:00.000Z
15	2023-01-13T01:14:00.000Z	1	2.11
16	2023-01-13T01:15:00.000Z
17	2023-01-13T01:16:00.000Z
18	2023-01-13T01:17:00.000Z
19	2023-01-13T01:18:00.000Z	2	3.57
20	2023-01-13T01:19:00.000Z
21	2023-01-13T01:20:00.000Z
22	2023-01-13T01:21:00.000Z
23	2023-01-13T01:22:00.000Z	1	3.21
23	2023-01-13T01:22:00.000Z	2	5.31

How do I write a query that returns one row per timestamp per parameter, where sample_value is the 'latest known' sample_value for that parameter like this:

sample_id	sample_time	parameter_id	sample_value
1	2023-01-13T01:00:00.000Z	1	1.23
1	2023-01-13T01:00:00.000Z	2	4.98
2	2023-01-13T01:01:00.000Z	1	1.23
2	2023-01-13T01:01:00.000Z	2	4.98
3	2023-01-13T01:02:00.000Z	1	1.23
3	2023-01-13T01:02:00.000Z	2	4.98
4	2023-01-13T01:03:00.000Z	1	1.23
4	2023-01-13T01:03:00.000Z	2	4.98
5	2023-01-13T01:04:00.000Z	1	1.23
5	2023-01-13T01:04:00.000Z	2	6.08
6	2023-01-13T01:05:00.000Z	1	1.23
6	2023-01-13T01:05:00.000Z	2	6.08
7	2023-01-13T01:06:00.000Z	1	1.89
7	2023-01-13T01:06:00.000Z	2	6.08
8	2023-01-13T01:07:00.000Z	1	1.89
8	2023-01-13T01:07:00.000Z	2	6.08

View on DB Fiddle

I cannot get my head around the LAST_VALUE function (if that is even the right tool for this?):

LAST_VALUE ( expression )  
OVER ( 
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

CodePudding user response：

First of all you need two rows for each of your sample ids. You can achieve it by cross joining your sample values with the distinct amount of parameters, and ensuring the condition on parameters is met as well on the left join.

...
FROM samples s
CROSS JOIN (SELECT DISTINCT parameter_id FROM sample_values) p
LEFT JOIN sample_values v 
       ON v.sample_id = s.sample_id AND v.parameter_id = p.parameter_id
...

In addition to this, your intuition of using the LAST_VALUE window function was correct. Problem is that PostgreSQL is unable to ignore null values till its current version. The only workaround for this problem is to generate partitioning on your parameter_ids and sample_value (each partition will contain one non-null value and the other null values), then taking the maximum value from each partition.

WITH cte AS (
    SELECT s.sample_id, s.sample_time, p.parameter_id, v.sample_value,
           COUNT(v.sample_value) OVER(
               PARTITION BY p.parameter_id 
               ORDER     BY s.sample_id
           ) AS partitions
    FROM samples s
    CROSS JOIN (SELECT DISTINCT parameter_id FROM sample_values) p
    LEFT JOIN sample_values v 
           ON v.sample_id = s.sample_id AND v.parameter_id = p.parameter_id
)
SELECT sample_id, sample_time, parameter_id, 
       COALESCE(sample_value, 
                MAX(sample_value) OVER (PARTITION BY parameter_id, partitions)
       ) AS sample_value
FROM cte
ORDER BY sample_id, parameter_id

Check the demo here.