get occurence of multiple rows in postgresql-CodePudding

I have a table that contains students publications like this

id	student
1	john
2	anthony
3	steven
4	lucille
5	anthony
6	steven
7	john
8	lucille
9	john
10	anthony
11	steven
12	lucille
13	john

so the idea is about to have a query that fetchs all ordered occurences of a determinated student names

context :

answer to the question : how many times John is publishing just after Anthony (who is publishing just after Steven ...) and get id of each occurence

example :

If I look for all occurences of [john, anthony] I'll get (note that the ids must be successive for each occurence)

id	student
1	john
2	anthony
9	john
10	anthony

Or :

id	-- comment
1	(id of first occurence of john, anthony)
9	(id of second occurence of john, anthony)

If I look for [anthony, steven, lucille] i'll get

id	student
2	anthony
3	steven
4	lucille
10	anthony
11	steven
12	lucille

Or :

id	-- comment
2	(id of first occurence of anthony, steven, lucille)
10	(id of second occurence of anthony, steven, lucille)

Any ideas or leads to help me move forward?

CodePudding user response：

That should do the trick, performance wise.

The main idea is to split the data by the first student that is in our search list, but not in all places -
Since the same student can appear multiple times in our search list, we need to make sure that we're not breaking the pattern in the middle.
We're doing that by verifying that each occurrence of the first student is far enough from its previous occurrence, that is, the distance between the two occurrences is bigger than the search list length (the number of non-unique students' names within the search list)

with
 prm(students) as (select 'anthony,steven,lucille,anthony')
,prm_ext(search_pattern, first_student, tokens_num) as 
 (
    select  regexp_replace(students, '^|(,)','\1\d ;', 'g') as search_pattern
           ,split_part(students, ',', 1)                    as first_student
           ,array_length(string_to_array(students, ','), 1) as tokens_num
           
    from    prm
)
,prev_student as
(
    select  id
           ,student
           ,lag(id) over (partition by student order by id) as student_prev_id
    
    from    t
)
,seq as
(
    select  id
           ,student
           ,sum(case when student = p.first_student and coalesce(id - student_prev_id >= p.tokens_num, true) then 1 end) over (order by id) as seq_id
           ,id - max(case when student = p.first_student then id end) over (order by id) as distance_from_first_student
    
    from    prev_student cross join prm_ext as p
    
    order by id
)
select    split_part(unnest(regexp_matches(string_agg(id || ';' || student, ',' order by id), (select search_pattern from prm_ext), 'g')), ';', 1)::int  as id
     
from      seq cross join prm_ext p

where     seq_id is not null
     and  distance_from_first_student < p.tokens_num
     
group by  seq_id

This is the result for an extended data sample:

id
2
16
22

Fiddle

CodePudding user response：

Start with this and if it explodes we'll do some performance improvements, with the price of making the code a little bit more complicated.

with
 prm(students) as (select 'anthony,steven,lucille')
,prm_ext(students_regex) as (select regexp_replace(students, '^|(,)','\1\d ;', 'g') from prm)
select  split_part(unnest(regexp_matches(string_agg(id || ';' || student, ',' order by id), (select students_regex from prm_ext), 'g')), ';', 1)::int  as id
from    t

id
2
10

with
 prm(students) as (select 'anthony,steven,lucille')
,prm_ext(students_regex) as (select regexp_replace(students, '^|(,)','\1\d ;', 'g') from prm)
select  cols[1]::int   as id
       ,cols[2]::text  as student
       
from   (select  string_to_array(string_to_table(unnest(regexp_matches(string_agg(id || ';' || student, ',' order by id), (select students_regex from prm_ext), 'g')), ','), ';') as cols
        from    t
       ) t

id	student
2	anthony
3	steven
4	lucille
10	anthony
11	steven
12	lucille

Fiddle