Why do I need to create a subquery instead of just comparing in this case?-CodePudding

I have written this query

SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME
FROM employees
WHERE SALARY > (SELECT AVG(SALARY)
                FROM employees)

I'm a bit confused why I have to create a subquery for it and why can I not just write the query like this:

SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME
FROM employees
WHERE SALARY > AVG(SALARY)

CodePudding user response：

SQL is a set based language. If you aggregate just one column of a set, you must aggregate the entire set. You can't both aggregate a set and not aggregate a set. Furthermore, a SELECT statement defines only a single set. Therefore you can't compare a value from an aggregated set to a value from a non-aggregated set (two sets) in the same SELECT statement.

Your first statement works because you have two SELECT statements that each define a distinct set. One set gets aggregated via the avg() function, and the other set remains non-aggregated.

It also works because the aggregated set is scalar (it has a single value). It can compare each row in your non-aggregated set to the scalar value held in your single-value aggregated set. Had that aggregated set been defined in such a way that it wasn't scalar, then an error would have been thrown and you would have had to establish a relationship between the two sets via an ON clause in a JOIN inside your FROM clause or through a correlated subquery.

The other reason is order of operations. Filtering via a WHERE clause happens first in the execution of SQL. At the point that your data is being filtered, no other operation has happened yet. GROUP BY/Aggregation occurs very late in the execution of SQL. So you are attempting to compare the results of two very different steps in SQL to each other.

Order of operations is the reason why HAVING exists. It's very similar to WHERE but acts AFTER aggregation. That won't help you with what you are trying to do in this SQL though, because, again, you are trying to compare a value from a non-aggregated set to a value from an aggregated set which simply can't be done in a single SELECT statement.

It's probably worth noting that you CAN aggregate (so-to-speak) inside of a non-aggregated set using "window functions" (also called "ordered analytical functions" or "analytical functions") which are supported in most RDBMSs.

For instance:

SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, SALARY, AVG(SALARY) OVER () as avg_salary
FROM employees;

This will still spit out a non-aggregated row for every row in table employees. It will have a 4th column which contains, for every row, the average salary for ALL employees. Every row will contain that same value.

 ------------- ------------ ------------ -------- ------------ 
| EMPLOYEE_ID | FIRST_NAME | LAST_NAME  | SALARY | avg_salary |
 ------------- ------------ ------------ -------- ------------ 
|           1 | bob        | mcbob      |    100 |       210  |
|           2 | sue        | o'susan    |    230 |       210  |
|           3 | venkat     | van venkat |    300 |       210  |
 ------------- ------------ ------------ -------- ------------

That being said, you can't compare the results of a window function inside of a WHERE or a HAVING clause because of order of operations. Window function logic runs after nearly every other step in sql execution. You will simply get an error that it isn't allowed. So you will need, again, two SELECT statements:

SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME
FROM 
    (
         SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, SALARY, AVG(SALARY) OVER () as avg_salary
         FROM employees;
    ) dt
WHERE SALARY > avg_salary;

Finally, there are two RDBMS's on the market that have a QUALIFY clause (Snowflake and Teradata) that are like a WHERE or HAVING clause that DO allow a window function to be part of a filter. If you are using one of those two platforms then you CAN write this as a single SELECT statement:

SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME
FROM employees
QUALIFY SALARY > AVG(SALARY) OVER ();

Just like WHERE acts at the beginninig of execution, and HAVING acts closer to the end of execution, QUALIFY acts even later than that (just before ORDER BY). Oddly, this is exactly what you were wanting in the first place and some RDBMS's anticipated your need. My hope is that more RDBMSs pick up that QUALIFY clause in future versions as it's VERY handy.