Database design: Simplicity in query vs simplicity in schema-CodePudding

There is an assignments table, and every assignment has a target GUID and a serial number. There are two processes: one inserts assignments and another one queries them and executes them. Assignments for one target must be processed strictly in serial number order, without any gaps.

If there is a gap in serial numbers sequence, the executing process should wait for this gap to be filled before processing further assignments for this target

We identified two approaches to solve this. The first one is to have a boolean flag "is_ready", and set it on each insert in the following way:

UPDATE assignments
SET is_ready = true
FROM (
    SELECT ROW_NUMBER () OVER (ORDER BY serial_number) 
    FROM assignments 
    WHERE target = <#address> and serial_number >= <#last_serial_number> 
) AS subquery
WHERE
    subquery.serial_number=<#last_serial_number>   subquery.row_number and
    transactions.last_serial_number =  subquery.last_serial_number and
    transactions.target = subquery.target;

Note that if there are any gaps, we only allow at most 5 assignments that are waiting (so the subquery will reutrn at most 5 rows)

Alternatively, we can have an integer field . It would represent the number of gaps in serial numbers for this target before current serial number. This way on insert we just have to update all the records with serial_number higher than the one being inserted:

UPDATE assignments
SET number_of_serial_gaps = number_of_serial_gaps - 1
WHERE target = <#address> AND serial_number >= <#last_serial_number>

And then on SELECT we just have to check that number_of_serial_gaps == 0.

The first approach has (arguably) cleaner DB structure - just one boolean field, while the second approach has an advantage of requiring a simpler query.

Both ways would probably do and I think the difference is not substantial, however perhaps there are some stronger arguments on either side?

CodePudding user response：

This seems like a question of "How do I avoid a race condition while externally processing rows?" Combined with the requirements of not missing any rows, and fault tolerance in case of failure or a hung process.

The answer to that question is more of a strategy than a specific how-to. I would suggest that you look into using PostgreSQL custom locks to make the coordination. This would allow you to work on the rows, not conflict with rows being worked on, and would be fault tolerant if the process acting on the rows ab-ends.

CodePudding user response：

My recomendation would be not to keep the status in the table, but to check it in the query of the executing process.

Lets assume sample data as follows

select * from tab
order by target,serial_number;

T SERIAL_NUMBER
- -------------
x             1
x             2
x             3
x             4
x             5
x             6
x             7
x             9
x            10

So the serial 8 is missing.

An executor that wants to get a batch of 5 row starting from 1 should suceed, but it fails to get a batch of 10 row starting from 1.

The query below performs exact that, simple selecting the allowed range of serial numbers and checks if all serials are available with the count subquery

-- get a batch of 5 row starting from 1
with dt as (
select * from tab
where target = 'x' 
and serial_number >= 1
and serial_number < 1   5)
select * from dt
where 
5 = (select count(distinct serial_number) from dt)
order by 1,2;

T SERIAL_NUMBER
- -------------
x             1
x             2
x             3
x             4
x             5

-- get a batch of 10 row starting from 1
with dt as (
select * from tab
where target = 'x' 
and serial_number >= 1
and serial_number < 1   10)
select * from dt
where 
10 = (select count(distinct serial_number) from dt)
order by 1,2;

no rows