I have a large database with around 75K points. This database represents a time series associated with countries with the following format:

|country | value | date |

I am using a fastapi API to serve this data on my front. Since the chart library I use will not use more than 300 points, I'd like to downsample the time series so that I have 300 points equally spaced, including first point and last point for each country.

As for now, I have come up with this SQL solution:

WITH
    E AS (
        SELECT *,
        PERCENT_RANK() OVER w as rw
        FROM my_table
        WINDOW w AS (PARTITION BY country ORDER BY date)
    )
SELECT *, FLOOR(300*rw) as sel FROM E GROUP BY E.country, sel;

I was wondering if there was a better way to do this / more optimized way?

I'm already using indexes, and I'm unsure if partitions on the SQL database has any impact

Example

If I downsample to 5 points per country, I would like to go from this:

 --------- -------- ------------ 
| country | value  | date       |
 --------- -------- ------------ 
| ar      |   1.4  | 2010-02-03 |
| ar      |   1.4  | 2010-02-04 |
| ar      |   1.3  | 2010-02-05 |
| ar      |   1.4  | 2010-02-06 |
| ar      |   1.2  | 2010-02-07 |
| ar      |   1.4  | 2010-02-08 |
| ar      |   1.5  | 2010-02-09 |
| ar      |   1.7  | 2010-02-10 |
| ar      |   1.4  | 2010-02-11 |
| ar      |   1.6  | 2010-02-12 |
| ar      |   1.4  | 2010-02-13 |
| ar      |   1.5  | 2010-02-14 |
| ar      |   1.3  | 2010-02-15 |
| ar      |   1.2  | 2010-02-16 |
| fr      |   1.3  | 2010-02-03 |
| fr      |   1.3  | 2010-02-04 |
| fr      |   1.4  | 2010-02-05 |
| fr      |   1.6  | 2010-02-06 |
| fr      |   1.9  | 2010-02-07 |
| fr      |   1.3  | 2010-02-08 |
| fr      |   1.3  | 2010-02-09 |
| fr      |   1.2  | 2010-02-10 |
| fr      |   1.3  | 2010-02-11 |
| fr      |   1.5  | 2010-02-12 |
| fr      |   1.3  | 2010-02-13 |
| fr      |   1.3  | 2010-02-14 |
| fr      |   1.5  | 2010-02-15 |
| fr      |   1.3  | 2010-02-16 |
 --------- -------- ------------

To this:

 --------- -------- ------------ 
| country | value  | date       |
 --------- -------- ------------ 
| ar      |   1.4  | 2010-02-03 |
| ar      |   1.4  | 2010-02-06 |
| ar      |   1.5  | 2010-02-09 |
| ar      |   1.4  | 2010-02-13 |
| ar      |   1.2  | 2010-02-16 |
| fr      |   1.3  | 2010-02-03 |
| fr      |   1.6  | 2010-02-06 |
| fr      |   1.3  | 2010-02-09 |
| fr      |   1.3  | 2010-02-13 |
| fr      |   1.3  | 2010-02-16 |
 --------- -------- ------------

Edit 2

With @Rick James answer, I've come up with this:

SELECT country, FLOOR(300*(TO_DAYS(date) - x.da) /(x.dd - x.da)) as g, date, value
    FROM table
    JOIN ( SELECT TO_DAYS(MIN(date)) AS da, TO_DAYS(MAX(date)) as dd, country as cc
              FROM table GROUP BY country
         ) AS x
         ON table.country = x.cc
    GROUP BY country, g;

I can tell this is faster, but as I am no expert, I can't tell if it is the best I can get / if it is not an absurd way of doing this

CodePudding user response：

I would develop a formula that maps the date a number such that MIN(date) maps to 0.5 and MAX(date) maps to 299.5.

SELECT country, ROUND(...that formula...), AVG(value)
    FROM t
    GROUP BY 1,2;

Note that AVG will effectively smooth out the curve rather than picking a particular entry (which might be especially high or low.).

Putting the pieces together:

SELECT country,
       ROUND((TO_DAYS(date) - x.da) / 301),
       AVG(value)
    FROM t
    JOIN ( SELECT TO_DAYS(MIN(date)) AS da
              FROM t
         ) AS x
    GROUP BY 1,2;

CodePudding user response：

WITH
  sorted AS
(
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY country
          ORDER BY date     -- choose a different column name!
    ) - 1
      AS row_id,
    COUNT(*) OVER (
      PARTITION BY country
    ) - 1
      AS step
  FROM
    my_table
)
SELECT
  sorted.*
FROM
  sorted
WHERE
  mod( row_id             * 4, step)
  <
  mod((row_id   step - 1) * 4, step)
ORDER BY
  country, row_id

The hard-coded * 4 should be * (sample_size - 1)

Demo: https://dbfiddle.uk/llmch7nm

EDIT:

The above behaves badly when the starting data set size is smaller than the target sample size.

This demo has two different fudges to work around that : https://dbfiddle.uk/J0Nb_CNb