Home > database >  Presto - Insert Missing Timestamps
Presto - Insert Missing Timestamps

Time:10-28

EDIT: Made a few clarifications in the post below

I am trying to work on a problem where I have a table that has missing timestamps. Let's say there is a table like this:

Timestamp NumericField
2021-10-24 16:59:00.000 101
2021-10-24 16:57:00.000 101

I would like to try and do 2 things:

  1. Populate a third record where the timestamp would be 2021-10-24 16:58:00.000.

  2. Along with that, I would like to populate the NumericField field as 101 if the lead and lag records match, like in this example. The result would be:

Timestamp NumericField
2021-10-24 16:59:00.000 101
2021-10-24 16:58:00.000 101
2021-10-24 16:57:00.000 101

If the lead and lag NumericField records do not match, then the NumericField that was generated would result in null. Example being:

Timestamp NumericField
2021-10-24 16:59:00.000 101
2021-10-24 16:58:00.000 NULL
2021-10-24 16:57:00.000 100

The reason I am posting this question is because recursive CTE's isn't supported in Presto and I couldn't find any good resources to help me with this problem.

CodePudding user response:

I would try using lag to find previous values and then sequence to generate array of dates with interval '1' minute step, unnest it and union the result with original table:

WITH dataset (Timestamp, NumericField) AS (
    VALUES (timestamp '2021-10-24 16:59:00.000', 101),
        (timestamp '2021-10-24 16:57:00.000', 101),
        (timestamp '2021-10-24 16:55:00.000', 99)
)
SELECT date as Timestamp,
    val as NumericField
FROM (
        SELECT array_except(
                sequence(prev_ts, Timestamp, interval '1' minute),
                array [ prev_ts, timestamp ] -- exclude border values
            ) as dates,
            case
                NumericField
                when prev_num then prev_num
            end as val
        FROM (
                SELECT *,
                    lag(Timestamp) over(order by Timestamp) prev_ts,
                    lag(NumericField) over(order by Timestamp) prev_num
                FROM dataset
            )
    ) seq
    CROSS JOIN UNNEST(dates) AS t (date)
UNION
SELECT *
FROM dataset
ORDER BY timestamp

Output:

Timestamp NumericField
2021-10-24 16:55:00.000 99
2021-10-24 16:56:00.000
2021-10-24 16:57:00.000 101
2021-10-24 16:58:00.000 101
2021-10-24 16:59:00.000 101
  • Related