More efficient way to write this query - several M:M relationships-CodePudding

I am using SQL Server 2014 SP3.

I have the following hypothetical database structure. There are accounts, which can belong to multiple customers, represented by the following tables: Account <- Account_Customer -> Customer

The customers, in turn, can own multiple cars: Customer <- Customer_Car -> Car

In addition, the customers can own many pets: Customer <- Customer_Pet -> Pet

Now I am trying to come up with the most efficient query to answer the following question: Get a list of accounts where none of the account owners have a "Cat" and none of the account owners drive a "Dodge".

The script below sets up the tables and some sample data. Please note that in real life, these tables will have 10's of millions of records, so I am trying to come up with the most efficient way to answer this question. So far I was only able to do it by accessing the same tables multiple times.

Setup script:

USE tempdb;

-- Create tables
IF OBJECT_ID('Account') IS NOT NULL DROP TABLE Account;
CREATE TABLE Account (AccountId INT, AccountName VARCHAR(24))

IF OBJECT_ID('Customer') IS NOT NULL DROP TABLE Customer;
CREATE TABLE Customer (CustomerId INT, CustomerName VARCHAR(24))

IF OBJECT_ID('Pet') IS NOT NULL DROP TABLE Pet;
CREATE TABLE Pet (PetId INT, PetName VARCHAR(24))

IF OBJECT_ID('Car') IS NOT NULL DROP TABLE Car;
CREATE TABLE Car (CarId INT, CarName VARCHAR(24))

IF OBJECT_ID('Account_Customer') IS NOT NULL DROP TABLE Account_Customer;
CREATE TABLE Account_Customer (AccountId INT, CustomerId INT)

IF OBJECT_ID('Customer_Pet') IS NOT NULL DROP TABLE Customer_Pet;
CREATE TABLE Customer_Pet (CustomerId INT, PetId INT)

IF OBJECT_ID('Customer_Car') IS NOT NULL DROP TABLE Customer_Car;
CREATE TABLE Customer_Car (CustomerId INT, CarId INT)

-- Populate data
INSERT [dbo].[Account]([AccountId], [AccountName])
VALUES (1, 'Account1'), (2, 'Account2')

INSERT [dbo].[Customer]([CustomerId], [CustomerName])
VALUES (1, 'Customer1'), (2, 'Customer2'), (3, 'Customer3'),  (4, 'Customer4')

INSERT [dbo].[Pet]([PetId], [PetName])
VALUES (1, 'Cat1'), (2, 'Cat2'), (3, 'Dog3'),  (4, 'Dog4')

INSERT [dbo].[Car]([CarId], [CarName])
VALUES (1, 'Ford1'), (2, 'Ford2'), (3, 'Kia3'),  (4, 'Dodge4')

INSERT [dbo].[Account_Customer] ([AccountId], [CustomerId])
VALUES (1,1), (1,2), (2, 2), (2,3), (2,4)

INSERT [dbo].[Customer_Pet] ([CustomerId], [PetId])
VALUES (2,3), (3,1), (3, 2), (4,3), (4,4)

INSERT [dbo].[Customer_Car] ([CustomerId], [CarId])
VALUES (1,2), (2,2), (3,1), (3, 2), (3, 4)

--SELECT * FROM [dbo].[Account] AS [A]
--SELECT * FROM [dbo].[Customer] AS [C]
--SELECT * FROM [dbo].[Pet] AS [P]
--SELECT * FROM [dbo].[Car] AS [C]
--SELECT * FROM [dbo].[Account_Customer] AS [AC]
--SELECT * FROM [dbo].[Customer_Pet] AS [CP]
--SELECT * FROM [dbo].[Customer_Car] AS [CC]

-- Bring all the data together to see what we have (denormalized)
SELECT [A].[AccountId], [A].[AccountName],
       [C].[CustomerId], [C].[CustomerName],
       [CP].[PetId], [P].[PetName],
        [C2].[CarId], [C2].[CarName]
FROM [dbo].[Customer] AS [C]
JOIN [dbo].[Account_Customer] AS [AC] ON [AC].[CustomerId] = [C].[CustomerId]
JOIN [dbo].[Account] AS [A] ON [A].[AccountId] = [AC].[AccountId]
LEFT JOIN [dbo].[Customer_Pet] AS [CP] ON [CP].[CustomerId] = [C].[CustomerId]
LEFT JOIN [dbo].[Pet] AS [P] ON [P].[PetId] = [CP].[PetId]
LEFT JOIN [dbo].[Customer_Car] AS [CC] ON [CC].[CustomerId] = [C].[CustomerId]
LEFT JOIN [dbo].[Car] AS [C2] ON [C2].[CarId] = [CC].[CarId]
ORDER BY [A].[AccountId], [AC].[CustomerId]

And here is the query, which answers my question, but I suspect it's inefficient on a large number of records. Is there a better way?


-- This should only return Account1
SELECT DISTINCT
       [A].[AccountId],
       [A].[AccountName]
FROM [dbo].[Customer] AS [C]
JOIN [dbo].[Account_Customer] AS [AC] ON [AC].[CustomerId] = [C].[CustomerId]
JOIN [dbo].[Account] AS [A] ON [A].[AccountId] = [AC].[AccountId]
EXCEPT
SELECT -- get Accounts where owner has a "Cat" or drives a "Dodge" 
       [A].[AccountId],
       [A].[AccountName]
FROM [dbo].[Customer] AS [C]
JOIN [dbo].[Account_Customer] AS [AC] ON [AC].[CustomerId] = [C].[CustomerId]
JOIN [dbo].[Account] AS [A] ON [A].[AccountId] = [AC].[AccountId]
WHERE 
(
    EXISTS (SELECT TOP (1) 1  
            FROM [dbo].[Customer] AS [C2]
            JOIN [dbo].[Customer_Pet] AS [CP2] ON [CP2].[CustomerId] = [C2].[CustomerId]
            JOIN [dbo].[Pet] AS [P2] ON [P2].[PetId] = [CP2].[PetId]
            WHERE [C2].[CustomerId] = [C].[CustomerId] -- correlation
            AND [P2].[PetName] LIKE 'Cat%'
            )
    OR
    EXISTS (SELECT TOP (1) 1  
            FROM [dbo].[Customer] AS [C2]
            JOIN [dbo].[Customer_Car] AS [CP2] ON [CP2].[CustomerId] = [C2].[CustomerId]
            JOIN [dbo].[Car] AS [P2] ON [P2].[CarId] = [CP2].[CarId]
            WHERE [C2].[CustomerId] = [C].[CustomerId] -- correlation
            AND [P2].[CarName] LIKE 'Dodge%'
            )
)

Sorry if this is obvious, but please observe that the query below will not work (because it answers slightly different question - return accounts where AT LEAST ONE OWNER does not have a "Cat" and does not drive a "Dodge":

-- Does not work:
SELECT DISTINCT 
       [A].[AccountId],
       [A].[AccountName]
FROM [dbo].[Customer] AS [C]
JOIN [dbo].[Account_Customer] AS [AC] ON [AC].[CustomerId] = [C].[CustomerId]
JOIN [dbo].[Account] AS [A] ON [A].[AccountId] = [AC].[AccountId]
WHERE 
(
    NOT EXISTS (SELECT TOP (1) 1  
            FROM [dbo].[Customer] AS [C2]
            JOIN [dbo].[Customer_Pet] AS [CP2] ON [CP2].[CustomerId] = [C2].[CustomerId]
            JOIN [dbo].[Pet] AS [P2] ON [P2].[PetId] = [CP2].[PetId]
            WHERE [C2].[CustomerId] = [C].[CustomerId] -- correlation
            AND [P2].[PetName] LIKE 'Cat%'
            )
    AND
    NOT EXISTS (SELECT TOP (1) 1  
            FROM [dbo].[Customer] AS [C2]
            JOIN [dbo].[Customer_Car] AS [CP2] ON [CP2].[CustomerId] = [C2].[CustomerId]
            JOIN [dbo].[Car] AS [P2] ON [P2].[CarId] = [CP2].[CarId]
            WHERE [C2].[CustomerId] = [C].[CustomerId] -- correlation
            AND [P2].[CarName] LIKE 'Dodge%'
            )
)

CodePudding user response：

I must say, in a real database I would be very suspicious of all these Many:Many relationships. Can an Account be owned by multiple Customers, each of whom can own multiple Accounts? Equally can a Cat or a Pet have multiple owners?

Be that as it may: you can express your query like this:

You want all Accounts...
for which there do not exist Account_Customers...
Where those Customers are in the set of Customers who own a Cat...
... or a Dodge

SELECT *
FROM Account a
WHERE NOT EXISTS (
    SELECT ac.CustomerId
    FROM Account_Customer ac
    WHERE ac.AccountId = a.AccountId
    INTERSECT
    (
    SELECT cp.CustomerId
    FROM Customer_Pet cp
    JOIN Pet p ON p.PetId = cp.PetId
    WHERE p.PetName LIKE 'Cat%'
    UNION ALL
    SELECT cc.CustomerId
    FROM Customer_Car cc
    JOIN Car c ON c.CarId = cc.CarId
    WHERE c.CarName LIKE 'Dodge%'
    )
)

db<>fiddle

CodePudding user response：

It's too late for a more in-depth answer, so here's a quick and dirty one with a temp table.

Mind you it's not as bad as it looks, many times I've had simple queries on temp tables massively outperform large, interesting (from a mathematic point of view) queries.

Also, a question about performance is never simple to answer. Of special interest is the fact that you mention millions of rows and need for performance while your query uses a like operator on some text column. At least the % is in the end, so it's still SARGable. Will this column have an index? That will probably make a difference.

Here (done blind, hopefully no errors):

create table #forbidden
(
    CustomerId int primary key 
)

insert #forbidden(CustomerId)
    select CustomerId from Customer C
    where 
           exists(select 1 from Customer_Pet CP where CP.CustomerId=C.CustomerId and CP.[PetName] LIKE 'Cat%')
        or exists(select 1 from Customer_Car CC where CC.CustomerId=C.CustomerId and CC.[CarName] LIKE 'Dodge%')
    
    
select * from Account A
where not exists
(
    select 1
    from Account_Customer AC
    where 
        AC.CustomerId=A.CustomerId
        and AC.CustomerId in (select CustomerId from #forbidden)
)