Home > OS >  Query database of events to find only events meeting parameters
Query database of events to find only events meeting parameters

Time:07-04

I have a dataset [Table_1] that records all events on a new row, meaning there are multiple entries for each customer_id. The structure is this;

customer_id recorded_at event_type value
123-456-789 2022-05-28 status open
123-456-789 2022-06-01 attribute order_placed
123-456-789 2022-06-02 attribute order_fulfilled
123-456-789 2022-06-04 status closed
123-456-789 2022-06-05 attribute order_placed
123-456-789 2022-06-07 attribute order_fulfilled
123-456-789 2022-06-10 status open
123-456-789 2022-06-11 attribute order_placed
123-456-789 2022-06-12 attribute order_fulfilled
123-456-789 2022-06-15 attribute order_placed
123-456-789 2022-06-17 attribute order_fulfilled
987-654-321 2022-06-12 status open
987-654-321 2022-06-15 attribute order_placed
987-654-321 2022-06-17 attribute order_fulfilled
987-654-321 2022-06-17 status closed

What I'm trying to do is write a query that returns the dates of the two attributes, order_placed and order_fulfilled after the last time the status went open. My approach is to query the dataset three times, first for all customers who went open, then returning the dates when the attributes are order_placed and order_cancelled, however I'm running into issues in returning all instances where the attributes are order_placed and order_fulfilled, not just the most recent one.

With d1 as (Select customer_id,recorded_at as open_time from Table_1 where event_type = 'status' and value = 'open')

Select d1.customer_id,
d1.open_time,
order_placed.order_placed_time,
order_fulfilled.order_filled_time

from d1

left join (Select customer_id,max(recorded_at) as order_placed_time from Table_1 where event_type = 'attribute' and value = 'order_placed') order_placed
on d1.customer_id = order_placed.customer_id and order_placed.order_placed_time > d1.open_time

left join (Select customer_id,max(recorded_at) as order_fulfilled_time from Table_1 where event_type = 'attribute' and value = 'order_fulfilled') order_filled
on d1.customer_id = order_filled.customer_id and order_filled.order_fulfilled_time > d1.open_time

where order_filled.order_fulfilled_time > order_placed.order_placed_time

However, this only returns the last time an order was placed and fulfilled after the status = open, not every instance where that happened. The output I am going for would look like:

customer_id open_time order_placed_time order_filled_time
123-456-789 2022-05-28 2022-06-01 2022-06-01
123-456-789 2022-06-10 2022-06-11 2022-06-12
123-456-789 2022-06-10 2022-06-15 2022-06-17
987-654-321 2022-06-12 2022-06-15 2022-06-17

CodePudding user response:

One option to solve this problem is following these steps:

  • keep all rows found between an open and an end, hence remove the end and the others
  • assign a unique id to different couples of ("order_placed","order_fulfilled")
  • extract the values relative to "open_time", "order_placed_time" and "order_fulfilled_time" with a CASE statement in three separate fields
  • apply different aggregations over "open_time" and "order_placed/fulfilled_time" separately, as long as each "open_time" can have multiple couples of orders.

These four steps are implemented in two ctes.


The first cte includes:

  • the first COUNT, that allows to extract even values for the open/order_placed/order_fulfilled (orders following open) values and odd values for the closed/order_placed/order_fulfilled values (orders following closed):
  • the second COUNT, that allows to extract different values for each couple made of ("order_placed", "order_fulfilled")
SELECT *, 
       COUNT(CASE WHEN value = 'open'   THEN 1 
                  WHEN value = 'closed' THEN 0 END) OVER (
           PARTITION BY customer_id
           ORDER     BY recorded_at, event_type
       ) AS status_value,
       COUNT(CASE WHEN value <> 'order_fulfilled' THEN 1 END) OVER(
           PARTITION BY customer_id 
           ORDER     BY recorded_at, event_type
       ) AS order_value 
FROM tab

The second cte includes:

  • a WHERE clause that filters out all rows that are found between a "closed" and an "open" value, first included, last excluded
  • the first MAX window function, that partitions on the customer and on the previous first COUNT function, to extract the "open_time" value
  • the second MAX window function, that partitions on the customer and on the previous second COUNT function, to extract the "order_placed_time" value
  • the third MAX window function, that partitions on the customer and on the previous second COUNT function, to extract the "order_fulfilled_time" value
SELECT customer_id,
       MAX(CASE WHEN value = 'open' THEN recorded_at END) OVER(
           PARTITION BY customer_id, status_value
       ) AS open_time,
       MAX(CASE WHEN value = 'order_placed' THEN recorded_at END) OVER(
           PARTITION BY customer_id, order_value
       ) AS order_placed_time,
       MAX(CASE WHEN value = 'order_fulfilled' THEN recorded_at END) OVER(
           PARTITION BY customer_id, order_value
       ) AS order_fulfilled_time
FROM cte 
WHERE MOD(status_value, 2) = 1

Note that it is not possible to use the MAX aggregation functions with a unique GROUP BY clause because the first MAX and the other two MAX aggregate on different columns respectively.


The final query uses the ctes and adds up:

  • a selection of DISTINCT rows (we're aggregating the output of the window functions)
  • a filtering operation on rows with NULL values in either the "order_placed_time" or "order_fulfilled_time" (correspond to the old "open" rows).
WITH cte AS (
    SELECT *, 
           COUNT(CASE WHEN value = 'open'   THEN 1 
                      WHEN value = 'closed' THEN 0 END) OVER (
               PARTITION BY customer_id
               ORDER     BY recorded_at, event_type
           ) AS status_value,
           COUNT(CASE WHEN value <> 'order_fulfilled' THEN 1 END) OVER(
               PARTITION BY customer_id 
               ORDER     BY recorded_at, event_type
           ) AS order_value 
    FROM tab
), cte2 AS(
    SELECT customer_id,
           MAX(CASE WHEN value = 'open' THEN recorded_at END) OVER(
               PARTITION BY customer_id, status_value
           ) AS open_time,
           MAX(CASE WHEN value = 'order_placed' THEN recorded_at END) OVER(
               PARTITION BY customer_id, order_value
           ) AS order_placed_time,
           MAX(CASE WHEN value = 'order_fulfilled' THEN recorded_at END) OVER(
               PARTITION BY customer_id, order_value
           ) AS order_fulfilled_time
    FROM cte 
    WHERE MOD(status_value, 2) = 1
)
SELECT DISTINCT * 
FROM cte2
WHERE order_fulfilled_time IS NOT NULL

I'd recommend to check intermediate output steps for a deep understanding of this specific solution.

CodePudding user response:

What I'm trying to do is write a query that returns the dates of the two attributes, order_placed and order_fulfilled after the last time the status went open.

Consider below query:

WITH orders AS (
  SELECT *, SUM(IF(value IN ('open', 'closed'), 1, 0)) OVER w AS order_group 
    FROM sample
  WINDOW w AS (PARTITION BY customer_id ORDER BY recorded_at, event_type)
)
SELECT customer_id, open_time, pre_recorded_at AS order_placed_time, recorded_at AS order_filled_time 
  FROM (
    SELECT *, FIRST_VALUE(IF(value = 'open', recorded_at, NULL)) OVER w AS open_time,
           LAG(recorded_at) OVER w AS pre_recorded_at,
      FROM orders
    WINDOW w AS (PARTITION BY customer_id, order_group ORDER BY recorded_at)
  )
 WHERE open_time IS NOT NULL AND value = 'order_fulfilled'
;

output will be:

enter image description here

Note: Due to transactions below in your dataset, orders CTE has a weired event_type column in ORDER BY clause. If you have more accurate timestamp recorded_at, it can be removed. I'll leave it to you.

  WINDOW w AS (PARTITION BY customer_id ORDER BY recorded_at, event_type)
987-654-321 2022-06-17  attribute   order_fulfilled
987-654-321 2022-06-17  status  closed

CodePudding user response:

with data as (
    select *, sum(case when value = 'open' then 1 end) over (partition by customer_id) as grp
    from T
)
select customer_id,
    min(case when value = 'open' then recorded_at end) as open_time,
    ...
from data
group by customer_id, grp

CodePudding user response:

Consider yet another option

with order_groups as (
  select *, 
    countif(value in ('open', 'closed')) over order_group_sorted as group_num,
    countif(value = 'order_placed') over order_group_sorted as subgroup_num,
  from your_table
  window order_group_sorted as (partition by customer_id order by recorded_at, event_type)
)
select * except(subgroup_num) from (
  select customer_id, recorded_at, value, subgroup_num, 
    max(if(value = 'open', recorded_at, null)) over order_group as open_time
  from order_groups
  window order_group as (partition by customer_id, group_num)
)
pivot (any_value(recorded_at) for value in ('order_placed', 'order_fulfilled'))
where not open_time || order_placed is null            

if applied to sample data in your question - output is

enter image description here

  • Related