I have a table contains info about customers and their purchases amount of each type of food. I want to create new columns that is the most freq type of food they have purchased. Is there an efficient way to do this?
I tried using case when and do one-to-one comparison, but it got very tedious.
Sample data:
Cust_ID | apple_type1 | apple_type2 | apple_type3 | apple_type4 | apple_type5 | apple_type6 |
---|---|---|---|---|---|---|
1 | 2 | 0 | 0 | 3 | 6 | 1 |
2 | 0 | 0 | 0 | 1 | 0 | 1 |
3 | 4 | 2 | 1 | 1 | 0 | 1 |
4 | 5 | 5 | 5 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
--WANT
Cust_ID | freq_apple_type_buy |
---|---|
1 | type5 |
2 | type4 and type6 |
3 | type1 |
4 | type1 and type2 and type3 |
5 | unknown |
CodePudding user response:
Consider below approach
select Cust_ID, if(count(1) = any_value(all_count), 'unknown', string_agg(type, ' and ')) freq_apple_type_buy
from (
select *, count(1) over(partition by Cust_ID) all_count
from (
select Cust_ID, replace(arr[offset(0)], 'apple_', '') type,cast(arr[offset(1)] as int64) value
from data t,
unnest(split(translate(to_json_string((select as struct * except(Cust_ID) from unnest([t]))), '{}"', ''))) kv,
unnest([struct(split(kv, ':') as arr)])
)
where true qualify 1 = rank() over(partition by Cust_ID order by value desc)
)
group by Cust_ID
if applied to sample data in your question - output is
CodePudding user response:
This uses UNPIVOT to turn your columns in to rows. Then uses RANK() to assign each row a rank, which means if multiple rows are matched in quantity, they share the same rank.
It then selects only the products with rank=1 (possibly multiple rows, if multiple products are tied for first place)
WITH
normalised_and_ranked AS
(
SELECT
cust_id,
product,
qty,
RANK() OVER (PARTITION BY cust_id ORDER BY qty DESC) AS product_rank,
ROW_NUMBER() OVER (PARTITION BY cust_id ORDER BY qty DESC) AS product_row
FROM
yourData
UNPIVOT(
qty FOR product IN (apple_type1, apple_type2, apple_type3, apple_type4, apple_type5, apple_type6)
)
)
SELECT
cust_id,
CASE WHEN qty = 0 THEN NULL ELSE product END AS product,
CASE WHEN qty = 0 THEN NULL ELSE qty END AS qty
FROM
normalised_and_ranked
WHERE
(product_rank = 1 AND qty > 0)
OR
(product_row = 1)
Edit: fudge added to ensure row of nulls returned if all qty are 0.
(Normally I'd just not return a row for such customers.)