Home > Blockchain >  De-duplicating combinations
De-duplicating combinations

Time:01-24

I have recently run a query on SQL that brings me the most common combinations of products on a basket. Here's how the query looks like:

WITH cte AS (
SELECT a.order_id, a.SKU_number as product_1, b.SKU_number as product_2, c.SKU_number as product_3, d.SKU_number as product_4
FROM [cons_customer].[sales_detail_time] a
JOIN [cons_customer].[sales_detail_time] b
ON a.order_id = b.order_id AND a.SKU_number <> b.SKU_number
JOIN [cons_customer].[sales_detail_time] c
ON a.order_id = c.order_id AND a.SKU_number <> c.SKU_number AND b.SKU_number <> c.SKU_number
JOIN [cons_customer].[sales_detail_time] d
ON a.order_id = d.order_id AND a.SKU_number <> d.SKU_number AND b.SKU_number <> d.SKU_number AND c.SKU_number <> d.SKU_number
WHERE a.SKU_number = 'PBPR108BAU.H01'
)
SELECT TOP 50 product_2, product_3, product_4, COUNT(*) as count
FROM cte
GROUP BY product_2, product_3, product_4
ORDER BY count DESC; 

However, there's one tiny problem with the results. I'm getting duplicated combinations, as the same products swap around the product_2, product_3 and product_4 columns. Here's an example:

  • I have one combination of 3 products: X, Y and Z.
  • The query I'm running is showing me three lines where:
product_2 product_3 product_4 count
X Y Z 18
Y Z X 18
Z X Y 18

As you can see, there is no duplicates along the columns, but these three lines are basically the same combination, but sorted on a different order. Any way of de-duplicating these values?

CodePudding user response:

Use < in place of <> in the JOIN conditions.

WITH cte AS (
    SELECT a.order_id, 
           a.SKU_number as product_1, 
           b.SKU_number as product_2, 
           c.SKU_number as product_3, 
           d.SKU_number as product_4
    FROM [cons_customer].[sales_detail_time] a
    JOIN [cons_customer].[sales_detail_time] b
      ON a.order_id = b.order_id AND a.SKU_number < b.SKU_number
    JOIN [cons_customer].[sales_detail_time] c
      ON a.order_id = c.order_id AND a.SKU_number < c.SKU_number 
                                 AND b.SKU_number < c.SKU_number
    JOIN [cons_customer].[sales_detail_time] d
      ON a.order_id = d.order_id AND a.SKU_number < d.SKU_number 
                                 AND b.SKU_number < d.SKU_number 
                                 AND c.SKU_number < d.SKU_number
    WHERE a.SKU_number = 'PBPR108BAU.H01'
)
SELECT TOP(50) product_2, product_3, product_4, COUNT(*) as count
FROM cte
GROUP BY product_2, product_3, product_4
ORDER BY count DESC; 

Given that you enforce a < b < c < d, you can try removing some conditions too.

WITH cte AS (
    SELECT a.order_id, 
           a.SKU_number as product_1, 
           b.SKU_number as product_2, 
           c.SKU_number as product_3, 
           d.SKU_number as product_4
    FROM [cons_customer].[sales_detail_time] a
    JOIN [cons_customer].[sales_detail_time] b
      ON a.order_id = b.order_id AND a.SKU_number < b.SKU_number
    JOIN [cons_customer].[sales_detail_time] c
      ON a.order_id = c.order_id AND b.SKU_number < c.SKU_number
    JOIN [cons_customer].[sales_detail_time] d
      ON a.order_id = d.order_id AND c.SKU_number < d.SKU_number
    WHERE a.SKU_number = 'PBPR108BAU.H01'
)
SELECT TOP(50) product_2, product_3, product_4, COUNT(*) as count
FROM cte
GROUP BY product_2, product_3, product_4
ORDER BY count DESC; 

CodePudding user response:

Huge thanks to @lemon on this. His suggestion was totally right, but the details of my table were a bit more complex than I fought. As my table has one entry for every product purchased, there were several purchases with more than 3 items that were not being identified by the query, once the query was looking only for the most combination of the 3 most purchased products. Therefore I had to adjust a little bit - with an extra challenge. There is also entries for items where revenue = 0, such as packaging or trial sachets.

This is what my final query looks like. Basically, I've asked SQL to bring me all the transactions with exactly 3 items where Revenue > 0 and also one extra fourth item which was the product I wanted to explore.

WITH cte AS (
SELECT order_id
FROM [cons_customer].[sales_detail_time]
WHERE sku_number = 'PBPR108BAU.H01' OR [revenue_tax_inc_AUD]
 > 0
GROUP BY order_id
HAVING SUM(CASE WHEN sku_number = 'PBPR108BAU.H01' THEN 1 ELSE 0 END) > 0
AND COUNT(DISTINCT CASE WHEN sku_number <> 'PBPR108BAU.H01' AND [revenue_tax_inc_AUD]
 > 0 THEN sku_number END) = 3
)
SELECT 'PBPR108BAU.H01' AS product_01,
t1.sku_number AS product_02,
t2.sku_number AS product_03,
t3.sku_number AS product_04,
COUNT(DISTINCT t1.order_id) AS count
FROM [cons_customer].[sales_detail_time] t1
JOIN [cons_customer].[sales_detail_time] t2
ON t1.order_id = t2.order_id AND t1.sku_number < t2.sku_number
JOIN [cons_customer].[sales_detail_time] t3
ON t2.order_id = t3.order_id AND t2.sku_number < t3.sku_number
AND t1.sku_number <> t3.sku_number
WHERE t1.order_id IN (SELECT order_id FROM cte)
AND t1.sku_number < 'PBPR108BAU.H01'
AND t2.sku_number < 'PBPR108BAU.H01'
AND t3.sku_number < 'PBPR108BAU.H01'
GROUP BY t1.sku_number, t2.sku_number, t3.sku_number
ORDER BY count DESC
  • Related