I have a table that contains students publications like this
id | student |
---|---|
1 | john |
2 | anthony |
3 | steven |
4 | lucille |
5 | anthony |
6 | steven |
7 | john |
8 | lucille |
9 | john |
10 | anthony |
11 | steven |
12 | lucille |
13 | john |
so the idea is about to have a query that fetchs all ordered occurences of a determinated student names
context :
answer to the question : how many times John is publishing just after Anthony (who is publishing just after Steven ...) and get id of each occurence
example :
If I look for all occurences of [john, anthony]
I'll get (note that the ids must be successive for each occurence)
id | student |
---|---|
1 | john |
2 | anthony |
9 | john |
10 | anthony |
Or :
id | -- comment |
---|---|
1 | (id of first occurence of john, anthony) |
9 | (id of second occurence of john, anthony) |
If I look for [anthony, steven, lucille]
i'll get
id | student |
---|---|
2 | anthony |
3 | steven |
4 | lucille |
10 | anthony |
11 | steven |
12 | lucille |
Or :
id | -- comment |
---|---|
2 | (id of first occurence of anthony, steven, lucille) |
10 | (id of second occurence of anthony, steven, lucille) |
Any ideas or leads to help me move forward?
CodePudding user response:
That should do the trick, performance wise.
The main idea is to split the data by the first student that is in our search list, but not in all places -
Since the same student can appear multiple times in our search list, we need to make sure that we're not breaking the pattern in the middle.
We're doing that by verifying that each occurrence of the first student is far enough from its previous occurrence, that is, the distance between the two occurrences is bigger than the search list length (the number of non-unique students' names within the search list)
with
prm(students) as (select 'anthony,steven,lucille,anthony')
,prm_ext(search_pattern, first_student, tokens_num) as
(
select regexp_replace(students, '^|(,)','\1\d ;', 'g') as search_pattern
,split_part(students, ',', 1) as first_student
,array_length(string_to_array(students, ','), 1) as tokens_num
from prm
)
,prev_student as
(
select id
,student
,lag(id) over (partition by student order by id) as student_prev_id
from t
)
,seq as
(
select id
,student
,sum(case when student = p.first_student and coalesce(id - student_prev_id >= p.tokens_num, true) then 1 end) over (order by id) as seq_id
,id - max(case when student = p.first_student then id end) over (order by id) as distance_from_first_student
from prev_student cross join prm_ext as p
order by id
)
select split_part(unnest(regexp_matches(string_agg(id || ';' || student, ',' order by id), (select search_pattern from prm_ext), 'g')), ';', 1)::int as id
from seq cross join prm_ext p
where seq_id is not null
and distance_from_first_student < p.tokens_num
group by seq_id
This is the result for an extended data sample:
id |
---|
2 |
16 |
22 |
CodePudding user response:
Start with this and if it explodes we'll do some performance improvements, with the price of making the code a little bit more complicated.
with
prm(students) as (select 'anthony,steven,lucille')
,prm_ext(students_regex) as (select regexp_replace(students, '^|(,)','\1\d ;', 'g') from prm)
select split_part(unnest(regexp_matches(string_agg(id || ';' || student, ',' order by id), (select students_regex from prm_ext), 'g')), ';', 1)::int as id
from t
id |
---|
2 |
10 |
with
prm(students) as (select 'anthony,steven,lucille')
,prm_ext(students_regex) as (select regexp_replace(students, '^|(,)','\1\d ;', 'g') from prm)
select cols[1]::int as id
,cols[2]::text as student
from (select string_to_array(string_to_table(unnest(regexp_matches(string_agg(id || ';' || student, ',' order by id), (select students_regex from prm_ext), 'g')), ','), ';') as cols
from t
) t
id | student |
---|---|
2 | anthony |
3 | steven |
4 | lucille |
10 | anthony |
11 | steven |
12 | lucille |