Home > other >  How to extract time series data before and after a specific event by using AWS athena?
How to extract time series data before and after a specific event by using AWS athena?

Time:09-17

I have the storage that stores huge time series data and I can extract data through AWS athena. However, I don't have any ideas how I can extract time series data before and after a specific event by using AWS athena.

What query can achieve it ?

Does anyone have ideas and samples of query on Athena ?

For example, I have following input data.

<input data>

id | timestamp              | value | level |
---------------------------------------------
1  | 2021-04-01T12:00:00 00:00 | 100.0 | 1     |
2  | 2021-04-01T12:00:10 00:00 | 98.0  | 1     |
3  | 2021-04-01T12:00:20 00:00 | 99.5  | 1     |
...
58 | 2021-04-01T12:09:40 00:00 | 98.2  | 1     |
59 | 2021-04-01T12:09:50 00:00 | 95.3  | 1     |
60 | 2021-04-01T12:10:00 00:00 | 99.2  | 1     |
61 | 2021-04-01T12:10:10 00:00 | 97.6  | 2     |
62 | 2021-04-01T12:10:20 00:00 | 98.6  | 2     |
63 | 2021-04-01T12:10:30 00:00 | 98.3  | 2     |
64 | 2021-04-01T12:10:40 00:00 | 98.1  | 2     |
...
100 | 2021-04-01T12:16:40 00:00 | 97.6  | 2     |

What I want to do is to extract the records for 30 sec before and after level 1->2 change event.

In this case, the expected outputs are data from id:58 to id:64.

CodePudding user response:

You can use 'lag' function to determine timestamps where the level changes:

SELECT *
FROM (SELECT *
      FROM (SELECT timestamp,
                   lag(level) OVER (order by timestamp) AS prev_level,
                   level
            FROM dataset)
      WHERE prev_level != level)

And then use those timestamps to filter out the dataset. For example something like this:

WITH dataset(id,timestamp,value,level) AS (
    VALUES 

('1',timestamp '2021-04-01 12:00:00 00:00',100.0,1),
('2',timestamp '2021-04-01 12:00:10 00:00',98.0,1),
('3',timestamp '2021-04-01 12:00:20 00:00',99.5,1),
('58',timestamp '2021-04-01 12:09:40 00:00',98.2,1),
('59',timestamp '2021-04-01 12:09:50 00:00',95.3,1),
('60',timestamp '2021-04-01 12:10:00 00:00',99.2,1),
('61',timestamp '2021-04-01 12:10:10 00:00',97.6,2),
('62',timestamp '2021-04-01 12:10:20 00:00',98.6,2),
('63',timestamp '2021-04-01 12:10:30 00:00',98.3,2),
('64',timestamp '2021-04-01 12:10:40 00:00',98.1,2),
('100',timestamp '2021-04-01 12:16:40 00:00',97.6,2)
)


SELECT *
FROM dataset o
WHERE EXISTS(
              SELECT *
              FROM (SELECT *
                    FROM (SELECT timestamp,
                                 lag(level) OVER (order by timestamp) AS prev_level,
                                 level
                          FROM dataset)
                    WHERE prev_level != level)
              WHERE (o.level = level AND o.timestamp - timestamp < interval '30' second)
                 OR (o.level = prev_level AND timestamp - o.timestamp < interval '30' second)
          )

Output:

id timestamp value level
59 2021-04-01 12:09:50.000 UTC 95.3 1
60 2021-04-01 12:10:00.000 UTC 99.2 1
61 2021-04-01 12:10:10.000 UTC 97.6 2
62 2021-04-01 12:10:20.000 UTC 98.6 2
63 2021-04-01 12:10:30.000 UTC 98.3 2
  • Related