Home > database >  How is this SQL sub-query correctly vectorising?
How is this SQL sub-query correctly vectorising?

Time:11-07

Sample data:

CREATE TABLE Departments (
  Code INTEGER PRIMARY KEY,
  Name varchar(255) NOT NULL ,
  Budget decimal NOT NULL 
);

CREATE TABLE Employees (
  SSN INTEGER PRIMARY KEY,
  Name varchar(255) NOT NULL ,
  LastName varchar(255) NOT NULL ,
  Department INTEGER NOT NULL , 
  foreign key (department) references Departments(Code) 
)

INSERT INTO Departments(Code,Name,Budget) VALUES(14,'IT',65000);
INSERT INTO Departments(Code,Name,Budget) VALUES(37,'Accounting',15000);
INSERT INTO Departments(Code,Name,Budget) VALUES(59,'Human Resources',240000);
INSERT INTO Departments(Code,Name,Budget) VALUES(77,'Research',55000);

INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('123234877','Michael','Rogers',14);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('152934485','Anand','Manikutty',14);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('222364883','Carol','Smith',37);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('326587417','Joe','Stevens',37);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('332154719','Mary-Anne','Foster',14);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('332569843','George','O''Donnell',77);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('546523478','John','Doe',59);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('631231482','David','Smith',77);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('654873219','Zacary','Efron',59);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('745685214','Eric','Goldsmith',59);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('845657245','Elizabeth','Doe',14);
INSERT INTO Employees(SSN,Name,LastName,Department) VALUES('845657246','Kumar','Swamy',14);

Problem: "Select the names of departments with more than two employees."

Wikibooks solution:

/*With subquery*/
SELECT D.Name FROM Departments D
  WHERE 2 < 
  (
   SELECT COUNT(*) 
     FROM Employees
     WHERE Department = D.Code
  );

My question: How does this solution work? That is, how does MSSQL know which values in Departments are to be kept from the sub-query? I can't see any way that the condition WHERE Department = D.Code can return a result that is ordered in a useful way to the outer query. I don't think that this is a fluke, I think that I just don't understand how SQL is vectorised.

CodePudding user response:

This is called a correlated subquery.

That is to say, the inner query is correlated to the outer one by use of an outer reference. In this case, that is D.Code. Therefore the subquery is being calculated for every row of D.

It's not a matter of ordering, in fact this query can return results in any order. But the result from the subquery must be greater than 2 otherwise the WHERE predicate fails.

SELECT D.Name FROM Departments D  -- Departments has been aliased as D
  WHERE 2 < 
  (
   SELECT COUNT(*) 
     FROM Employees
     WHERE Department = D.Code    -- Here the inner query is being limited by
                                  -- the reference to the outer D table
  );

I would probably use ... > 2 rather than 2 < ...

Side point: It's better to always use an explicit table reference in subqueries, eg e.Department = D.Code, because otherwise you could misspell a column and end up referring to an outer column instead of an inner column, and the correlation wouldn't work properly

  • Related