I have recently run a query on SQL that brings me the most common combinations of products on a basket. Here's how the query looks like:
WITH cte AS (
SELECT a.order_id, a.SKU_number as product_1, b.SKU_number as product_2, c.SKU_number as product_3, d.SKU_number as product_4
FROM [cons_customer].[sales_detail_time] a
JOIN [cons_customer].[sales_detail_time] b
ON a.order_id = b.order_id AND a.SKU_number <> b.SKU_number
JOIN [cons_customer].[sales_detail_time] c
ON a.order_id = c.order_id AND a.SKU_number <> c.SKU_number AND b.SKU_number <> c.SKU_number
JOIN [cons_customer].[sales_detail_time] d
ON a.order_id = d.order_id AND a.SKU_number <> d.SKU_number AND b.SKU_number <> d.SKU_number AND c.SKU_number <> d.SKU_number
WHERE a.SKU_number = 'PBPR108BAU.H01'
)
SELECT TOP 50 product_2, product_3, product_4, COUNT(*) as count
FROM cte
GROUP BY product_2, product_3, product_4
ORDER BY count DESC;
However, there's one tiny problem with the results. I'm getting duplicated combinations, as the same products swap around the product_2, product_3 and product_4 columns. Here's an example:
- I have one combination of 3 products: X, Y and Z.
- The query I'm running is showing me three lines where:
product_2 | product_3 | product_4 | count |
---|---|---|---|
X | Y | Z | 18 |
Y | Z | X | 18 |
Z | X | Y | 18 |
As you can see, there is no duplicates along the columns, but these three lines are basically the same combination, but sorted on a different order. Any way of de-duplicating these values?
CodePudding user response:
Use <
in place of <>
in the JOIN
conditions.
WITH cte AS (
SELECT a.order_id,
a.SKU_number as product_1,
b.SKU_number as product_2,
c.SKU_number as product_3,
d.SKU_number as product_4
FROM [cons_customer].[sales_detail_time] a
JOIN [cons_customer].[sales_detail_time] b
ON a.order_id = b.order_id AND a.SKU_number < b.SKU_number
JOIN [cons_customer].[sales_detail_time] c
ON a.order_id = c.order_id AND a.SKU_number < c.SKU_number
AND b.SKU_number < c.SKU_number
JOIN [cons_customer].[sales_detail_time] d
ON a.order_id = d.order_id AND a.SKU_number < d.SKU_number
AND b.SKU_number < d.SKU_number
AND c.SKU_number < d.SKU_number
WHERE a.SKU_number = 'PBPR108BAU.H01'
)
SELECT TOP(50) product_2, product_3, product_4, COUNT(*) as count
FROM cte
GROUP BY product_2, product_3, product_4
ORDER BY count DESC;
Given that you enforce a < b < c < d, you can try removing some conditions too.
WITH cte AS (
SELECT a.order_id,
a.SKU_number as product_1,
b.SKU_number as product_2,
c.SKU_number as product_3,
d.SKU_number as product_4
FROM [cons_customer].[sales_detail_time] a
JOIN [cons_customer].[sales_detail_time] b
ON a.order_id = b.order_id AND a.SKU_number < b.SKU_number
JOIN [cons_customer].[sales_detail_time] c
ON a.order_id = c.order_id AND b.SKU_number < c.SKU_number
JOIN [cons_customer].[sales_detail_time] d
ON a.order_id = d.order_id AND c.SKU_number < d.SKU_number
WHERE a.SKU_number = 'PBPR108BAU.H01'
)
SELECT TOP(50) product_2, product_3, product_4, COUNT(*) as count
FROM cte
GROUP BY product_2, product_3, product_4
ORDER BY count DESC;
CodePudding user response:
Huge thanks to @lemon on this. His suggestion was totally right, but the details of my table were a bit more complex than I fought. As my table has one entry for every product purchased, there were several purchases with more than 3 items that were not being identified by the query, once the query was looking only for the most combination of the 3 most purchased products. Therefore I had to adjust a little bit - with an extra challenge. There is also entries for items where revenue = 0, such as packaging or trial sachets.
This is what my final query looks like. Basically, I've asked SQL to bring me all the transactions with exactly 3 items where Revenue > 0 and also one extra fourth item which was the product I wanted to explore.
WITH cte AS (
SELECT order_id
FROM [cons_customer].[sales_detail_time]
WHERE sku_number = 'PBPR108BAU.H01' OR [revenue_tax_inc_AUD]
> 0
GROUP BY order_id
HAVING SUM(CASE WHEN sku_number = 'PBPR108BAU.H01' THEN 1 ELSE 0 END) > 0
AND COUNT(DISTINCT CASE WHEN sku_number <> 'PBPR108BAU.H01' AND [revenue_tax_inc_AUD]
> 0 THEN sku_number END) = 3
)
SELECT 'PBPR108BAU.H01' AS product_01,
t1.sku_number AS product_02,
t2.sku_number AS product_03,
t3.sku_number AS product_04,
COUNT(DISTINCT t1.order_id) AS count
FROM [cons_customer].[sales_detail_time] t1
JOIN [cons_customer].[sales_detail_time] t2
ON t1.order_id = t2.order_id AND t1.sku_number < t2.sku_number
JOIN [cons_customer].[sales_detail_time] t3
ON t2.order_id = t3.order_id AND t2.sku_number < t3.sku_number
AND t1.sku_number <> t3.sku_number
WHERE t1.order_id IN (SELECT order_id FROM cte)
AND t1.sku_number < 'PBPR108BAU.H01'
AND t2.sku_number < 'PBPR108BAU.H01'
AND t3.sku_number < 'PBPR108BAU.H01'
GROUP BY t1.sku_number, t2.sku_number, t3.sku_number
ORDER BY count DESC