In a data table, data is stored based on the production state and shipment of a product. This means product states are
product_created
packed
shipped
delivered
The packed and shipped states are inserted from a legacy system where these states can come late, even after the box has been delivered status. I want to take the last state of the product if the product journey follows the correct sequence (product_created -> packed -> shipped -> delivered
)
It should also support duplicate records, for example, product_created -> packed -> packed -> packed -> shipped -> shipped -> delivered
. In this case, the sequence is correct, so it should be taken into consideration.
input table
PRODUCT_ID STATE INSERTION_TIME
1 product_created 2023-01-10 07:00:00
1 product_created 2023-01-10 09:00:00
1 packed 2023-01-11 01:00:00
1 packed 2023-01-11 02:00:00
1 packed 2023-01-11 09:00:00
1 shipped 2023-01-12 01:00:00
1 delivered 2023-01-12 02:00:00
2 product_created 2023-01-10 07:00:44
2 packed 2023-01-11 01:00:00
2 delivered 2023-01-11 09:00:00
2 shipped 2023-01-12 02:00:00
3 product_created 2023-01-10 07:00:00
3 packed 2023-01-11 01:00:00
3 product_created 2023-01-11 09:00:00
3 packed 2023-01-11 09:00:00
3 shipped 2023-01-12 01:00:00
3 delivered 2023-01-12 02:00:00
Output
PRODUCT_ID STATE INSERTION_TIME
1 delivered 2023-01-12 02:00:00
Above result does not content PRODUCT_ID =2 and 3 because it contains incorrect sequence.
My query gets the final state of a product, but I don't know how to exclude products with incorrect sequences
SELECT * FROM datatable
QUALIFY ROW_NUMBER() OVER ( PARTITION BY PRODUCT_ID ORDER BY INSERTION_TIME DESC) = 1
This query only gives the last state of each product but it may contains some product which have incorrect sequence.
CodePudding user response:
SQL allows to detect patterns across multiple rows using
MATCH_RECOGNIZE db<>fiddle demo
CodePudding user response:
Assuming there can only be 4 distinct patterns for correct sequences, we can start by taking care of the duplicated states by keeping the most recent entry per product_id and state.
In the final select we can pick the last state per product_id on the condition that all its states when aggregated and ordered by insertion_time match one of the valid sequences.
with cte as
(select *
from dataTable
qualify row_number() over (partition by product_id, state order by insertion_time desc)=1)
select *
from cte
qualify row_number() over (partition by product_id order by insertion_time desc) = 1 and
listagg(state,',') within group (order by insertion_time asc) over (partition by product_id) in
('product_created,packed,shipped,delivered',
'product_created,packed,shipped',
'product_created,packed',
'product_created')