I want to make two lists.
- Customers who have bought product 'a' OR product 'b'
- Customers who have bought product 'a' AND product 'b'
Number 1 is easy
SELECT SalesTable.OrderAccount
FROM SalesTable
WHERE SalesTable.ItemID = 'a' OR SalesTable.ItemId = 'b'
How do I solve number 2?
Thank you
CodePudding user response:
There are a few ways to do this.
One way is to select where SalesTable.ItemID IN ('a', 'b')
. Then you GROUP BY the customer and select rows HAVING two records. You have to be careful here to make sure you're also limiting to one order of each type. This method can be good because, done right, it limits you to one check through your table, but it can be tricky avoiding situations where a single customer may have ordered the same product more than once. It look something like this:
With T As (
SELECT DISTINCT OrderAccount, ItemID
FROM SalesTable
WHERE ItemID IN ('a', 'b')
)
SELECT OrderAccount
FROM T
GROUP BY OrderAccount
HAVING COUNT(ItemID) == 2
Another way is to JOIN SalesTable to itself using a different alias name for each instance of the table, where the join conditions restrict each instance of the table to a different one of the desired products and the same customer. This is more reliable about multiple orders for the same product, but it has to look through the table twice and multiply items in the result set when there are multiple matches on both sides of the JOIN.
SELECT DISTNCT s1.OrderAccount
FROM SalesTable s1
INNER JOIN SalesTable s2 ON s1.OrderAccount = s2.OrderAccount AND s2.ItemID = 'b'
WHERE s1.ItemID = '1'
Another option is using the row_number()
windowing function to partition by your products and look for the second row. This fits somewhere in between the two: it only goes through the full table once, but must then review the (somewhat smaller) initial results to get to the final answer. However, the query optimizer can often make this perform just as well as the first option.
SELECT OrderAccount
FROM (
SELECT OrderAccount, row_number()
OVER (partition by OrderAccount, ItemID ORDER BY OrderAccount, ItemID) rn
FROM SalesTable
WHERE ItemID IN ('a', 'b')
) T
WHERE rn = 2
CodePudding user response:
The first query can be written as
SELECT DISTINCT orderaccount
FROM salestable
WHERE itemid IN ('a', 'b');
The DISTINCT
is necessary to get each orderaccount
only once, no matter if they ordered only item a or only item b or both and whether they ordered them once or many times.
If you want to get order customers that bought both products, then use the same query, but group by customer and count distinct products:
SELECT orderaccount
FROM salestable
WHERE itemid IN ('a', 'b')
GROUP BY orderaccount
HAVING COUNT(DISTINCT itemid) = 2;
As we GROUP BY orderaccount
, we don't need to SELECT DISTINCT
anymore, because with the GROUP BY
clause we already aggregate per orderaccount
and get each just once.
If you have an order account table, you can also use IN
or EXISTS
for the lookup. Thus the DBMS can stop reading once it found a matching purchase. This may not matter here (a customer probably won't buy an item again and again and again), but in other situations (imagine a store selling an item a million times and you merely want to know whether the store sold it at least once or not at all) it can be very beneficial:
SELECT orderaccount
FROM orderaccounts
WHERE orderaccount IN (SELECT orderaccount FROM salestable WHERE itemid = 'a')
AND orderaccount IN (SELECT orderaccount FROM salestable WHERE itemid = 'b');
The same applies to the first query of course, where you would have OR
instead of AND
. With a large intermediate data set, DISTINCT
can be a costly operation.