I have a large database with around 75K points. This database represents a time series associated with countries with the following format:
|country | value | date |
I am using a fastapi API to serve this data on my front. Since the chart library I use will not use more than 300 points, I'd like to downsample the time series so that I have 300 points equally spaced, including first point and last point for each country.
As for now, I have come up with this SQL solution:
WITH
E AS (
SELECT *,
PERCENT_RANK() OVER w as rw
FROM my_table
WINDOW w AS (PARTITION BY country ORDER BY date)
)
SELECT *, FLOOR(300*rw) as sel FROM E GROUP BY E.country, sel;
I was wondering if there was a better way to do this / more optimized way?
I'm already using indexes, and I'm unsure if partitions on the SQL database has any impact
Example
If I downsample to 5 points per country, I would like to go from this:
--------- -------- ------------
| country | value | date |
--------- -------- ------------
| ar | 1.4 | 2010-02-03 |
| ar | 1.4 | 2010-02-04 |
| ar | 1.3 | 2010-02-05 |
| ar | 1.4 | 2010-02-06 |
| ar | 1.2 | 2010-02-07 |
| ar | 1.4 | 2010-02-08 |
| ar | 1.5 | 2010-02-09 |
| ar | 1.7 | 2010-02-10 |
| ar | 1.4 | 2010-02-11 |
| ar | 1.6 | 2010-02-12 |
| ar | 1.4 | 2010-02-13 |
| ar | 1.5 | 2010-02-14 |
| ar | 1.3 | 2010-02-15 |
| ar | 1.2 | 2010-02-16 |
| fr | 1.3 | 2010-02-03 |
| fr | 1.3 | 2010-02-04 |
| fr | 1.4 | 2010-02-05 |
| fr | 1.6 | 2010-02-06 |
| fr | 1.9 | 2010-02-07 |
| fr | 1.3 | 2010-02-08 |
| fr | 1.3 | 2010-02-09 |
| fr | 1.2 | 2010-02-10 |
| fr | 1.3 | 2010-02-11 |
| fr | 1.5 | 2010-02-12 |
| fr | 1.3 | 2010-02-13 |
| fr | 1.3 | 2010-02-14 |
| fr | 1.5 | 2010-02-15 |
| fr | 1.3 | 2010-02-16 |
--------- -------- ------------
To this:
--------- -------- ------------
| country | value | date |
--------- -------- ------------
| ar | 1.4 | 2010-02-03 |
| ar | 1.4 | 2010-02-06 |
| ar | 1.5 | 2010-02-09 |
| ar | 1.4 | 2010-02-13 |
| ar | 1.2 | 2010-02-16 |
| fr | 1.3 | 2010-02-03 |
| fr | 1.6 | 2010-02-06 |
| fr | 1.3 | 2010-02-09 |
| fr | 1.3 | 2010-02-13 |
| fr | 1.3 | 2010-02-16 |
--------- -------- ------------
Edit 2
With @Rick James answer, I've come up with this:
SELECT country, FLOOR(300*(TO_DAYS(date) - x.da) /(x.dd - x.da)) as g, date, value
FROM table
JOIN ( SELECT TO_DAYS(MIN(date)) AS da, TO_DAYS(MAX(date)) as dd, country as cc
FROM table GROUP BY country
) AS x
ON table.country = x.cc
GROUP BY country, g;
I can tell this is faster, but as I am no expert, I can't tell if it is the best I can get / if it is not an absurd way of doing this
CodePudding user response:
I would develop a formula that maps the date a number such that MIN(date) maps to 0.5 and MAX(date) maps to 299.5.
SELECT country, ROUND(...that formula...), AVG(value)
FROM t
GROUP BY 1,2;
Note that AVG
will effectively smooth out the curve rather than picking a particular entry (which might be especially high or low.).
Putting the pieces together:
SELECT country,
ROUND((TO_DAYS(date) - x.da) / 301),
AVG(value)
FROM t
JOIN ( SELECT TO_DAYS(MIN(date)) AS da
FROM t
) AS x
GROUP BY 1,2;
CodePudding user response:
WITH
sorted AS
(
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY country
ORDER BY date -- choose a different column name!
) - 1
AS row_id,
COUNT(*) OVER (
PARTITION BY country
) - 1
AS step
FROM
my_table
)
SELECT
sorted.*
FROM
sorted
WHERE
mod( row_id * 4, step)
<
mod((row_id step - 1) * 4, step)
ORDER BY
country, row_id
The hard-coded * 4
should be * (sample_size - 1)
Demo: https://dbfiddle.uk/llmch7nm
EDIT:
The above behaves badly when the starting data set size is smaller than the target sample size.
This demo has two different fudges to work around that : https://dbfiddle.uk/J0Nb_CNb