Using MySQL window function to section timeseries when value changes-CodePudding

I am looking for a way to split a dataset into sections whenever a value change inside a partition. Note that each switch of value is a new section - also if the value switches back to the same as in an earlier section.

Trying to illustrate using some dummy data:

Schema (MySQL v8.0)

CREATE TABLE test (
  id INT AUTO_INCREMENT PRIMARY KEY,
  group_id INT,
  date DATE,
  value VARCHAR(255)
);
INSERT INTO test (group_id, date, value) VALUES
    (2, '2022-03-12', 'D'),
    (2, '2022-03-15', 'C'),
    (1, '2022-03-27', 'B'),
    (1, '2022-02-01', 'A'),
    (2, '2022-03-10', 'D'),
    (1, '2022-05-01', 'C'),
    (2, '2022-01-19', 'C'),
    (1, '2022-01-18', 'A'),
    (2, '2022-04-12', 'D'),
    (1, '2022-04-08', 'A');

Query #1

SELECT
    id,
    group_id,
    date,
    value,
    '' as section # This is the one that I don't know how to calculate
FROM test ORDER BY group_id, date;

Results I want (the section is not calculated correctly):

id	group_id	date	value	section
8	1	2022-01-18	A	1
4	1	2022-02-01	A	1
3	1	2022-03-27	B	2
10	1	2022-04-08	A	3
6	1	2022-05-01	C	4
7	2	2022-01-19	C	5
5	2	2022-03-10	D	6
1	2	2022-03-12	D	6
2	2	2022-03-15	C	7
9	2	2022-04-12	D	8

View on DB Fiddle

Notice that my main problem is that I cannot partition by (group_id, value) since it's a new section whenever the value changes, regardless if it changes back to an earlier value (see value A for example).

CodePudding user response：

You get a number that distinguishes which time a value has occurred for a group_id by subtracting the number of times that value has occurred for the group_id so far from the total occurrences of that group_id so far; a little thought will show you this value will always be the same within a series of the same value and always different from that same value appearing at a different time.

From that number, you can calculate your sequential section number. There may be a way to do that directly (with one fewer subquery), but I had to use an intermediate step of getting the date that a particular run of values for a group_id started.

SELECT id, group_id, date, value,
    dense_rank() over (partition by group_id order by group_value_incidence_start) section
FROM (    
    SELECT id, group_id, date, value,
        min(date) over (partition by group_id, value, group_value_incidence) group_value_incidence_start
    FROM (
        SELECT id, group_id, date, value,
            count(1) over (partition by group_id order by date) -
                count(1) over (partition by group_id, value order by date) group_value_incidence
        FROM test
    ) group_value_indidences
) group_value_incidence_starts
ORDER BY group_id, section

fiddle