I have the one column in the table of AWS Athena with structure as follows.
Order_id Item_Count Path_List order_date
1 1 A, B, C, A, W 2022-08-23
2 3 C, A, D, Z 2022-08-21
Path -> type array of Strings.
The first row indicates that order_id 1 had 1 item which was purchased after the item passed path A->B->C->A->W.
Similarly, 2nd row indicates that order_id 2 had 3 items which were purchased after the items passed path C->A->D->Z
Now I need to write a SQL query which gives me the list of all the paths and their total contribution in the orders for a given time-range.
So to print distinct path items, I have written the below query.
select path_item from table_name cross join unnest(path_list) as t(path_item)
where date(order_date) <= current_date and date(order_date) >= current_date - interval '6' day group by 1
So I get all the individual path stages for all the path_list. The output is as follows:
A
B
C
W
D
Z
Now I want to find how much stage A or C or D contributed to the overall products purchased.
To find contribution of A, it should be (Product which followed A)/(Total path items) = 5/17
total path items = 15 34 = 17 Items having A path = 2 3*1 = 5
Similarly, for B and W, it will be 1/17
For C, 4/17
for D, 3/17
For Z, 3/17
Can you suggest the SQL to print the below output
A 5/17
B 1/17
C 4/17
W 1/17
D 3/17
Z 3/17
CodePudding user response:
You can use group by
to process the unnested data (notice that I use succinct style allowing to skip cross join
for unnest
) and use either windows functions or subquery to count total number of elements for the divisor. With window function:
-- sample data
with dataset (Order_id, Item_Count, Path_List, order_date) as (
values (1, 1, array['A', 'B', 'C', 'A', 'W'], '2022-08-23'),
(2, 3, array['C', 'A', 'D', 'Z' ], '2022-08-21')
)
-- query
select edge,
cast(edge_cnt as varchar)
|| '/'
|| cast(sum(edge_cnt) over (range between unbounded preceding and UNBOUNDED FOLLOWING) as varchar)
from (select sum(Item_Count) edge_cnt, edge
from dataset,
unnest(Path_List) as t(edge)
group by edge)
order by edge;
With subquery:
-- query
select edge,
cast(edge_cnt as varchar)
|| '/'
|| cast((select sum(Item_Count * cardinality(Path_List)) from dataset) as varchar)
from (select sum(Item_Count) edge_cnt, edge
from dataset,
unnest(Path_List) as t(edge)
group by edge)
order by edge;
Output:
edge | _col1 |
---|---|
A | 5/17 |
B | 1/17 |
C | 4/17 |
D | 3/17 |
W | 1/17 |
Z | 3/17 |