I am using SQL Server 2014 SP3.
I have the following hypothetical database structure. There are accounts, which can belong to multiple customers, represented by the following tables: Account <- Account_Customer -> Customer
The customers, in turn, can own multiple cars: Customer <- Customer_Car -> Car
In addition, the customers can own many pets: Customer <- Customer_Pet -> Pet
Now I am trying to come up with the most efficient query to answer the following question: Get a list of accounts where none of the account owners have a "Cat" and none of the account owners drive a "Dodge".
The script below sets up the tables and some sample data. Please note that in real life, these tables will have 10's of millions of records, so I am trying to come up with the most efficient way to answer this question. So far I was only able to do it by accessing the same tables multiple times.
Setup script:
USE tempdb;
-- Create tables
IF OBJECT_ID('Account') IS NOT NULL DROP TABLE Account;
CREATE TABLE Account (AccountId INT, AccountName VARCHAR(24))
IF OBJECT_ID('Customer') IS NOT NULL DROP TABLE Customer;
CREATE TABLE Customer (CustomerId INT, CustomerName VARCHAR(24))
IF OBJECT_ID('Pet') IS NOT NULL DROP TABLE Pet;
CREATE TABLE Pet (PetId INT, PetName VARCHAR(24))
IF OBJECT_ID('Car') IS NOT NULL DROP TABLE Car;
CREATE TABLE Car (CarId INT, CarName VARCHAR(24))
IF OBJECT_ID('Account_Customer') IS NOT NULL DROP TABLE Account_Customer;
CREATE TABLE Account_Customer (AccountId INT, CustomerId INT)
IF OBJECT_ID('Customer_Pet') IS NOT NULL DROP TABLE Customer_Pet;
CREATE TABLE Customer_Pet (CustomerId INT, PetId INT)
IF OBJECT_ID('Customer_Car') IS NOT NULL DROP TABLE Customer_Car;
CREATE TABLE Customer_Car (CustomerId INT, CarId INT)
-- Populate data
INSERT [dbo].[Account]([AccountId], [AccountName])
VALUES (1, 'Account1'), (2, 'Account2')
INSERT [dbo].[Customer]([CustomerId], [CustomerName])
VALUES (1, 'Customer1'), (2, 'Customer2'), (3, 'Customer3'), (4, 'Customer4')
INSERT [dbo].[Pet]([PetId], [PetName])
VALUES (1, 'Cat1'), (2, 'Cat2'), (3, 'Dog3'), (4, 'Dog4')
INSERT [dbo].[Car]([CarId], [CarName])
VALUES (1, 'Ford1'), (2, 'Ford2'), (3, 'Kia3'), (4, 'Dodge4')
INSERT [dbo].[Account_Customer] ([AccountId], [CustomerId])
VALUES (1,1), (1,2), (2, 2), (2,3), (2,4)
INSERT [dbo].[Customer_Pet] ([CustomerId], [PetId])
VALUES (2,3), (3,1), (3, 2), (4,3), (4,4)
INSERT [dbo].[Customer_Car] ([CustomerId], [CarId])
VALUES (1,2), (2,2), (3,1), (3, 2), (3, 4)
--SELECT * FROM [dbo].[Account] AS [A]
--SELECT * FROM [dbo].[Customer] AS [C]
--SELECT * FROM [dbo].[Pet] AS [P]
--SELECT * FROM [dbo].[Car] AS [C]
--SELECT * FROM [dbo].[Account_Customer] AS [AC]
--SELECT * FROM [dbo].[Customer_Pet] AS [CP]
--SELECT * FROM [dbo].[Customer_Car] AS [CC]
-- Bring all the data together to see what we have (denormalized)
SELECT [A].[AccountId], [A].[AccountName],
[C].[CustomerId], [C].[CustomerName],
[CP].[PetId], [P].[PetName],
[C2].[CarId], [C2].[CarName]
FROM [dbo].[Customer] AS [C]
JOIN [dbo].[Account_Customer] AS [AC] ON [AC].[CustomerId] = [C].[CustomerId]
JOIN [dbo].[Account] AS [A] ON [A].[AccountId] = [AC].[AccountId]
LEFT JOIN [dbo].[Customer_Pet] AS [CP] ON [CP].[CustomerId] = [C].[CustomerId]
LEFT JOIN [dbo].[Pet] AS [P] ON [P].[PetId] = [CP].[PetId]
LEFT JOIN [dbo].[Customer_Car] AS [CC] ON [CC].[CustomerId] = [C].[CustomerId]
LEFT JOIN [dbo].[Car] AS [C2] ON [C2].[CarId] = [CC].[CarId]
ORDER BY [A].[AccountId], [AC].[CustomerId]
And here is the query, which answers my question, but I suspect it's inefficient on a large number of records. Is there a better way?
-- This should only return Account1
SELECT DISTINCT
[A].[AccountId],
[A].[AccountName]
FROM [dbo].[Customer] AS [C]
JOIN [dbo].[Account_Customer] AS [AC] ON [AC].[CustomerId] = [C].[CustomerId]
JOIN [dbo].[Account] AS [A] ON [A].[AccountId] = [AC].[AccountId]
EXCEPT
SELECT -- get Accounts where owner has a "Cat" or drives a "Dodge"
[A].[AccountId],
[A].[AccountName]
FROM [dbo].[Customer] AS [C]
JOIN [dbo].[Account_Customer] AS [AC] ON [AC].[CustomerId] = [C].[CustomerId]
JOIN [dbo].[Account] AS [A] ON [A].[AccountId] = [AC].[AccountId]
WHERE
(
EXISTS (SELECT TOP (1) 1
FROM [dbo].[Customer] AS [C2]
JOIN [dbo].[Customer_Pet] AS [CP2] ON [CP2].[CustomerId] = [C2].[CustomerId]
JOIN [dbo].[Pet] AS [P2] ON [P2].[PetId] = [CP2].[PetId]
WHERE [C2].[CustomerId] = [C].[CustomerId] -- correlation
AND [P2].[PetName] LIKE 'Cat%'
)
OR
EXISTS (SELECT TOP (1) 1
FROM [dbo].[Customer] AS [C2]
JOIN [dbo].[Customer_Car] AS [CP2] ON [CP2].[CustomerId] = [C2].[CustomerId]
JOIN [dbo].[Car] AS [P2] ON [P2].[CarId] = [CP2].[CarId]
WHERE [C2].[CustomerId] = [C].[CustomerId] -- correlation
AND [P2].[CarName] LIKE 'Dodge%'
)
)
Sorry if this is obvious, but please observe that the query below will not work (because it answers slightly different question - return accounts where AT LEAST ONE OWNER does not have a "Cat" and does not drive a "Dodge":
-- Does not work:
SELECT DISTINCT
[A].[AccountId],
[A].[AccountName]
FROM [dbo].[Customer] AS [C]
JOIN [dbo].[Account_Customer] AS [AC] ON [AC].[CustomerId] = [C].[CustomerId]
JOIN [dbo].[Account] AS [A] ON [A].[AccountId] = [AC].[AccountId]
WHERE
(
NOT EXISTS (SELECT TOP (1) 1
FROM [dbo].[Customer] AS [C2]
JOIN [dbo].[Customer_Pet] AS [CP2] ON [CP2].[CustomerId] = [C2].[CustomerId]
JOIN [dbo].[Pet] AS [P2] ON [P2].[PetId] = [CP2].[PetId]
WHERE [C2].[CustomerId] = [C].[CustomerId] -- correlation
AND [P2].[PetName] LIKE 'Cat%'
)
AND
NOT EXISTS (SELECT TOP (1) 1
FROM [dbo].[Customer] AS [C2]
JOIN [dbo].[Customer_Car] AS [CP2] ON [CP2].[CustomerId] = [C2].[CustomerId]
JOIN [dbo].[Car] AS [P2] ON [P2].[CarId] = [CP2].[CarId]
WHERE [C2].[CustomerId] = [C].[CustomerId] -- correlation
AND [P2].[CarName] LIKE 'Dodge%'
)
)
CodePudding user response:
I must say, in a real database I would be very suspicious of all these Many:Many relationships. Can an Account be owned by multiple Customers, each of whom can own multiple Accounts? Equally can a Cat or a Pet have multiple owners?
Be that as it may: you can express your query like this:
- You want all Accounts...
- for which there do not exist Account_Customers...
- Where those Customers are in the set of Customers who own a Cat...
- ... or a Dodge
SELECT *
FROM Account a
WHERE NOT EXISTS (
SELECT ac.CustomerId
FROM Account_Customer ac
WHERE ac.AccountId = a.AccountId
INTERSECT
(
SELECT cp.CustomerId
FROM Customer_Pet cp
JOIN Pet p ON p.PetId = cp.PetId
WHERE p.PetName LIKE 'Cat%'
UNION ALL
SELECT cc.CustomerId
FROM Customer_Car cc
JOIN Car c ON c.CarId = cc.CarId
WHERE c.CarName LIKE 'Dodge%'
)
)
CodePudding user response:
It's too late for a more in-depth answer, so here's a quick and dirty one with a temp table.
Mind you it's not as bad as it looks, many times I've had simple queries on temp tables massively outperform large, interesting (from a mathematic point of view) queries.
Also, a question about performance is never simple to answer. Of special interest is the fact that you mention millions of rows and need for performance while your query uses a like operator on some text column. At least the % is in the end, so it's still SARGable. Will this column have an index? That will probably make a difference.
Here (done blind, hopefully no errors):
create table #forbidden
(
CustomerId int primary key
)
insert #forbidden(CustomerId)
select CustomerId from Customer C
where
exists(select 1 from Customer_Pet CP where CP.CustomerId=C.CustomerId and CP.[PetName] LIKE 'Cat%')
or exists(select 1 from Customer_Car CC where CC.CustomerId=C.CustomerId and CC.[CarName] LIKE 'Dodge%')
select * from Account A
where not exists
(
select 1
from Account_Customer AC
where
AC.CustomerId=A.CustomerId
and AC.CustomerId in (select CustomerId from #forbidden)
)