Data:
I have a table with the following 2 columns: id, and timestamp. A data example is shown below:
id timestamp
1 2021-06-30 02:19:27.582
2 2021-06-25 03:20:23.324
3 2021-05-03 12:33:33.743
4 2021-07-20 23:04:42.105
...
Question:
I want to find the timestamp at the 5th percentile, the 10th percentile, the 15th percentile, etc, up until the 100th percentile. I thought it should be straightforward, just using the percentile function on the timestamp column should do. However, in Snowflake the percentile functions PERCENTILE_CONT
and PERCENTILE_DISC
do not accept datetime or date type.
Requirement:
It is a big data set, and I will have to use SQL to address it. Since the data is not evenly distributed by time, I don't want to use (endtime-starttime)/100 * n starttime
to find the nth percentile timestamp.
I have searched a while online and still haven't found anything helpful. Could anyone please share your idea? Thank you!
Solution:
See the original answer from @Lukasz Szozda below. The solution is shown below. The main idea is to transform datetime or date type data to numerical values first so that the percentile function can take it and digest it. Then remember to transform the final output to datetime or date format by the end.
TO_TIMESTAMP(PERCENTILE_CONT(0.05)
WITHIN GROUP(ORDER BY DATE_PART(EPOCH_SECOND, timestamp)) OVER())
AS date_at_5th_percentile
CodePudding user response:
Using PERCENT_RANK:
CREATE OR REPLACE TABLE tab(id INT, tm TIMESTAMP)
AS
SELECT 1, '2021-06-30 02:19:27.582' UNION ALL
SELECT 2, '2021-06-25 03:20:23.324' UNION ALL
SELECT 3, '2021-05-03 12:33:33.743' UNION ALL
SELECT 4, '2021-07-20 23:04:42.105';
Query:
SELECT *, PERCENT_RANK() OVER(/*PARTITION BY ... */ ORDER BY tm) AS p
FROM tab
ORDER BY id;
However, in Snowflake the percentile functions PERCENTILE_CONT and PERCENTILE_DISC do not accept datetime or date type.
SELECT *, PERCENTILE_CONT(0.05) WITHIN GROUP(ORDER BY tm) OVER() AS p
FROM tab
ORDER BY id;
-- incompatible types: [TIMESTAMP_NTZ(9)] and [NUMBER(9,0)]
Convert Timetamp to EPOCH_SECOND - seconds, integer and back to timestamp(ltz/ntz/...) aferwards:
TO_TIMESTAMP(PERCENTILE_CONT(0.05)
WITHIN GROUP(ORDER BY DATE_PART(EPOCH_SECOND,tm)) OVER()) AS p