I have a dataset [Table_1] that records all events on a new row, meaning there are multiple entries for each customer_id. The structure is this;
customer_id | recorded_at | event_type | value |
---|---|---|---|
123-456-789 | 2022-05-28 | status | open |
123-456-789 | 2022-06-01 | attribute | order_placed |
123-456-789 | 2022-06-02 | attribute | order_fulfilled |
123-456-789 | 2022-06-04 | status | closed |
123-456-789 | 2022-06-05 | attribute | order_placed |
123-456-789 | 2022-06-07 | attribute | order_fulfilled |
123-456-789 | 2022-06-10 | status | open |
123-456-789 | 2022-06-11 | attribute | order_placed |
123-456-789 | 2022-06-12 | attribute | order_fulfilled |
123-456-789 | 2022-06-15 | attribute | order_placed |
123-456-789 | 2022-06-17 | attribute | order_fulfilled |
987-654-321 | 2022-06-12 | status | open |
987-654-321 | 2022-06-15 | attribute | order_placed |
987-654-321 | 2022-06-17 | attribute | order_fulfilled |
987-654-321 | 2022-06-17 | status | closed |
What I'm trying to do is write a query that returns the dates of the two attributes, order_placed and order_fulfilled after the last time the status went open. My approach is to query the dataset three times, first for all customers who went open, then returning the dates when the attributes are order_placed and order_cancelled, however I'm running into issues in returning all instances where the attributes are order_placed and order_fulfilled, not just the most recent one.
With d1 as (Select customer_id,recorded_at as open_time from Table_1 where event_type = 'status' and value = 'open')
Select d1.customer_id,
d1.open_time,
order_placed.order_placed_time,
order_fulfilled.order_filled_time
from d1
left join (Select customer_id,max(recorded_at) as order_placed_time from Table_1 where event_type = 'attribute' and value = 'order_placed') order_placed
on d1.customer_id = order_placed.customer_id and order_placed.order_placed_time > d1.open_time
left join (Select customer_id,max(recorded_at) as order_fulfilled_time from Table_1 where event_type = 'attribute' and value = 'order_fulfilled') order_filled
on d1.customer_id = order_filled.customer_id and order_filled.order_fulfilled_time > d1.open_time
where order_filled.order_fulfilled_time > order_placed.order_placed_time
However, this only returns the last time an order was placed and fulfilled after the status = open, not every instance where that happened. The output I am going for would look like:
customer_id | open_time | order_placed_time | order_filled_time |
---|---|---|---|
123-456-789 | 2022-05-28 | 2022-06-01 | 2022-06-01 |
123-456-789 | 2022-06-10 | 2022-06-11 | 2022-06-12 |
123-456-789 | 2022-06-10 | 2022-06-15 | 2022-06-17 |
987-654-321 | 2022-06-12 | 2022-06-15 | 2022-06-17 |
CodePudding user response:
One option to solve this problem is following these steps:
- keep all rows found between an open and an end, hence remove the end and the others
- assign a unique id to different couples of ("order_placed","order_fulfilled")
- extract the values relative to "open_time", "order_placed_time" and "order_fulfilled_time" with a
CASE
statement in three separate fields - apply different aggregations over "open_time" and "order_placed/fulfilled_time" separately, as long as each "open_time" can have multiple couples of orders.
These four steps are implemented in two ctes.
The first cte includes:
- the first
COUNT
, that allows to extract even values for the open/order_placed/order_fulfilled (orders following open) values and odd values for the closed/order_placed/order_fulfilled values (orders following closed): - the second
COUNT
, that allows to extract different values for each couple made of ("order_placed", "order_fulfilled")
SELECT *,
COUNT(CASE WHEN value = 'open' THEN 1
WHEN value = 'closed' THEN 0 END) OVER (
PARTITION BY customer_id
ORDER BY recorded_at, event_type
) AS status_value,
COUNT(CASE WHEN value <> 'order_fulfilled' THEN 1 END) OVER(
PARTITION BY customer_id
ORDER BY recorded_at, event_type
) AS order_value
FROM tab
The second cte includes:
- a
WHERE
clause that filters out all rows that are found between a "closed" and an "open" value, first included, last excluded - the first
MAX
window function, that partitions on the customer and on the previous firstCOUNT
function, to extract the "open_time" value - the second
MAX
window function, that partitions on the customer and on the previous secondCOUNT
function, to extract the "order_placed_time" value - the third
MAX
window function, that partitions on the customer and on the previous secondCOUNT
function, to extract the "order_fulfilled_time" value
SELECT customer_id,
MAX(CASE WHEN value = 'open' THEN recorded_at END) OVER(
PARTITION BY customer_id, status_value
) AS open_time,
MAX(CASE WHEN value = 'order_placed' THEN recorded_at END) OVER(
PARTITION BY customer_id, order_value
) AS order_placed_time,
MAX(CASE WHEN value = 'order_fulfilled' THEN recorded_at END) OVER(
PARTITION BY customer_id, order_value
) AS order_fulfilled_time
FROM cte
WHERE MOD(status_value, 2) = 1
Note that it is not possible to use the MAX
aggregation functions with a unique GROUP BY
clause because the first MAX
and the other two MAX
aggregate on different columns respectively.
The final query uses the ctes and adds up:
- a selection of
DISTINCT
rows (we're aggregating the output of the window functions) - a filtering operation on rows with NULL values in either the "order_placed_time" or "order_fulfilled_time" (correspond to the old "open" rows).
WITH cte AS (
SELECT *,
COUNT(CASE WHEN value = 'open' THEN 1
WHEN value = 'closed' THEN 0 END) OVER (
PARTITION BY customer_id
ORDER BY recorded_at, event_type
) AS status_value,
COUNT(CASE WHEN value <> 'order_fulfilled' THEN 1 END) OVER(
PARTITION BY customer_id
ORDER BY recorded_at, event_type
) AS order_value
FROM tab
), cte2 AS(
SELECT customer_id,
MAX(CASE WHEN value = 'open' THEN recorded_at END) OVER(
PARTITION BY customer_id, status_value
) AS open_time,
MAX(CASE WHEN value = 'order_placed' THEN recorded_at END) OVER(
PARTITION BY customer_id, order_value
) AS order_placed_time,
MAX(CASE WHEN value = 'order_fulfilled' THEN recorded_at END) OVER(
PARTITION BY customer_id, order_value
) AS order_fulfilled_time
FROM cte
WHERE MOD(status_value, 2) = 1
)
SELECT DISTINCT *
FROM cte2
WHERE order_fulfilled_time IS NOT NULL
I'd recommend to check intermediate output steps for a deep understanding of this specific solution.
CodePudding user response:
What I'm trying to do is write a query that returns the dates of the two attributes, order_placed and order_fulfilled after the last time the status went open.
Consider below query:
WITH orders AS (
SELECT *, SUM(IF(value IN ('open', 'closed'), 1, 0)) OVER w AS order_group
FROM sample
WINDOW w AS (PARTITION BY customer_id ORDER BY recorded_at, event_type)
)
SELECT customer_id, open_time, pre_recorded_at AS order_placed_time, recorded_at AS order_filled_time
FROM (
SELECT *, FIRST_VALUE(IF(value = 'open', recorded_at, NULL)) OVER w AS open_time,
LAG(recorded_at) OVER w AS pre_recorded_at,
FROM orders
WINDOW w AS (PARTITION BY customer_id, order_group ORDER BY recorded_at)
)
WHERE open_time IS NOT NULL AND value = 'order_fulfilled'
;
output will be:
Note: Due to transactions below in your dataset, orders CTE has a weired event_type column in ORDER BY clause. If you have more accurate timestamp recorded_at, it can be removed. I'll leave it to you.
WINDOW w AS (PARTITION BY customer_id ORDER BY recorded_at, event_type)
987-654-321 2022-06-17 attribute order_fulfilled
987-654-321 2022-06-17 status closed
CodePudding user response:
with data as (
select *, sum(case when value = 'open' then 1 end) over (partition by customer_id) as grp
from T
)
select customer_id,
min(case when value = 'open' then recorded_at end) as open_time,
...
from data
group by customer_id, grp
CodePudding user response:
Consider yet another option
with order_groups as (
select *,
countif(value in ('open', 'closed')) over order_group_sorted as group_num,
countif(value = 'order_placed') over order_group_sorted as subgroup_num,
from your_table
window order_group_sorted as (partition by customer_id order by recorded_at, event_type)
)
select * except(subgroup_num) from (
select customer_id, recorded_at, value, subgroup_num,
max(if(value = 'open', recorded_at, null)) over order_group as open_time
from order_groups
window order_group as (partition by customer_id, group_num)
)
pivot (any_value(recorded_at) for value in ('order_placed', 'order_fulfilled'))
where not open_time || order_placed is null
if applied to sample data in your question - output is