Outer Join 'fill-in-the blanks'
I have a pair of master-detail tables in a PostgreSQL database where master table 'samples' has some samples with a timestamp in each. The detail table 'sample_values' has some values for some parameters at any given sample timestamp.
My Query
SELECT s.sample_id, s.sample_time, v.parameter_id, v.sample_value
FROM samples s LEFT OUTER JOIN sample_values v ON v.sample_id=s.sample_id
ORDER BY s.sample_id, v.parameter_id;
returns (as expected):
sample_id | sample_time | parameter_id | sample_value |
---|---|---|---|
1 | 2023-01-13T01:00:00.000Z | 1 | 1.23 |
1 | 2023-01-13T01:00:00.000Z | 2 | 4.98 |
2 | 2023-01-13T01:01:00.000Z | ||
3 | 2023-01-13T01:02:00.000Z | ||
4 | 2023-01-13T01:03:00.000Z | ||
5 | 2023-01-13T01:04:00.000Z | 2 | 6.08 |
6 | 2023-01-13T01:05:00.000Z | ||
7 | 2023-01-13T01:06:00.000Z | 1 | 1.89 |
8 | 2023-01-13T01:07:00.000Z | ||
9 | 2023-01-13T01:08:00.000Z | ||
10 | 2023-01-13T01:09:00.000Z | ||
11 | 2023-01-13T01:10:00.000Z | ||
12 | 2023-01-13T01:11:00.000Z | ||
13 | 2023-01-13T01:12:00.000Z | ||
14 | 2023-01-13T01:13:00.000Z | ||
15 | 2023-01-13T01:14:00.000Z | 1 | 2.11 |
16 | 2023-01-13T01:15:00.000Z | ||
17 | 2023-01-13T01:16:00.000Z | ||
18 | 2023-01-13T01:17:00.000Z | ||
19 | 2023-01-13T01:18:00.000Z | 2 | 3.57 |
20 | 2023-01-13T01:19:00.000Z | ||
21 | 2023-01-13T01:20:00.000Z | ||
22 | 2023-01-13T01:21:00.000Z | ||
23 | 2023-01-13T01:22:00.000Z | 1 | 3.21 |
23 | 2023-01-13T01:22:00.000Z | 2 | 5.31 |
How do I write a query that returns one row per timestamp per parameter, where sample_value is the 'latest known' sample_value for that parameter like this:
sample_id | sample_time | parameter_id | sample_value |
---|---|---|---|
1 | 2023-01-13T01:00:00.000Z | 1 | 1.23 |
1 | 2023-01-13T01:00:00.000Z | 2 | 4.98 |
2 | 2023-01-13T01:01:00.000Z | 1 | 1.23 |
2 | 2023-01-13T01:01:00.000Z | 2 | 4.98 |
3 | 2023-01-13T01:02:00.000Z | 1 | 1.23 |
3 | 2023-01-13T01:02:00.000Z | 2 | 4.98 |
4 | 2023-01-13T01:03:00.000Z | 1 | 1.23 |
4 | 2023-01-13T01:03:00.000Z | 2 | 4.98 |
5 | 2023-01-13T01:04:00.000Z | 1 | 1.23 |
5 | 2023-01-13T01:04:00.000Z | 2 | 6.08 |
6 | 2023-01-13T01:05:00.000Z | 1 | 1.23 |
6 | 2023-01-13T01:05:00.000Z | 2 | 6.08 |
7 | 2023-01-13T01:06:00.000Z | 1 | 1.89 |
7 | 2023-01-13T01:06:00.000Z | 2 | 6.08 |
8 | 2023-01-13T01:07:00.000Z | 1 | 1.89 |
8 | 2023-01-13T01:07:00.000Z | 2 | 6.08 |
I cannot get my head around the LAST_VALUE function (if that is even the right tool for this?):
LAST_VALUE ( expression )
OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
CodePudding user response:
First of all you need two rows for each of your sample ids. You can achieve it by cross joining your sample values with the distinct amount of parameters, and ensuring the condition on parameters is met as well on the left join.
...
FROM samples s
CROSS JOIN (SELECT DISTINCT parameter_id FROM sample_values) p
LEFT JOIN sample_values v
ON v.sample_id = s.sample_id AND v.parameter_id = p.parameter_id
...
In addition to this, your intuition of using the LAST_VALUE
window function was correct. Problem is that PostgreSQL is unable to ignore null values till its current version. The only workaround for this problem is to generate partitioning on your parameter_ids and sample_value (each partition will contain one non-null value and the other null values), then taking the maximum value from each partition.
WITH cte AS (
SELECT s.sample_id, s.sample_time, p.parameter_id, v.sample_value,
COUNT(v.sample_value) OVER(
PARTITION BY p.parameter_id
ORDER BY s.sample_id
) AS partitions
FROM samples s
CROSS JOIN (SELECT DISTINCT parameter_id FROM sample_values) p
LEFT JOIN sample_values v
ON v.sample_id = s.sample_id AND v.parameter_id = p.parameter_id
)
SELECT sample_id, sample_time, parameter_id,
COALESCE(sample_value,
MAX(sample_value) OVER (PARTITION BY parameter_id, partitions)
) AS sample_value
FROM cte
ORDER BY sample_id, parameter_id
Check the demo here.