How to produce the following output in SQL-CodePudding

I have the one column in the table of AWS Athena with structure as follows.

Order_id     Item_Count   Path_List                 order_date
1            1            A, B, C, A, W             2022-08-23
2            3            C, A, D, Z                2022-08-21

Path -> type array of Strings.

The first row indicates that order_id 1 had 1 item which was purchased after the item passed path A->B->C->A->W.

Similarly, 2nd row indicates that order_id 2 had 3 items which were purchased after the items passed path C->A->D->Z

Now I need to write a SQL query which gives me the list of all the paths and their total contribution in the orders for a given time-range.

So to print distinct path items, I have written the below query.

select path_item from table_name cross join unnest(path_list) as t(path_item)
where date(order_date) <= current_date and date(order_date) >= current_date - interval '6' day group by 1

So I get all the individual path stages for all the path_list. The output is as follows:

A
B
C
W
D
Z

Now I want to find how much stage A or C or D contributed to the overall products purchased.

To find contribution of A, it should be (Product which followed A)/(Total path items) = 5/17

total path items = 15 34 = 17 Items having A path = 2 3*1 = 5

Similarly, for B and W, it will be 1/17

For C, 4/17

for D, 3/17

For Z, 3/17

Can you suggest the SQL to print the below output

CodePudding user response：

You can use group by to process the unnested data (notice that I use succinct style allowing to skip cross join for unnest) and use either windows functions or subquery to count total number of elements for the divisor. With window function:

-- sample data
with dataset (Order_id, Item_Count, Path_List, order_date) as (
    values (1, 1, array['A', 'B', 'C', 'A', 'W'], '2022-08-23'),
        (2, 3, array['C', 'A', 'D', 'Z'     ], '2022-08-21')
)

-- query
select edge,
       cast(edge_cnt as varchar)
           || '/'
           || cast(sum(edge_cnt) over (range between unbounded preceding and UNBOUNDED FOLLOWING) as varchar)
from (select sum(Item_Count) edge_cnt, edge
      from dataset,
           unnest(Path_List) as t(edge)
      group by edge)
order by edge;

With subquery:

-- query
select edge,
       cast(edge_cnt as varchar)
           || '/'
           || cast((select sum(Item_Count * cardinality(Path_List)) from dataset) as varchar)
from (select sum(Item_Count) edge_cnt, edge
      from dataset,
           unnest(Path_List) as t(edge)
      group by edge)
order by edge;

Output:

edge	_col1
A	5/17
B	1/17
C	4/17
D	3/17
W	1/17
Z	3/17