group by needed for window function?-CodePudding

I have this schema for employee table:

id  int
first_name varchar
last_name  varchar
department_id  int
department_name  varchar
position  varchar

I want to rank departments by size. This works:

select
department_id d_id,
rank() over (order by count(*) desc) r
from employees
group by department_id

What I don't understand is why group by is required. If I remove it I get this error:

column "employee.department_id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2: department_id d_id,

CodePudding user response：

This query:

select department_id d_id, 
       count(*) number_of_employees
from az_employees
group by department_id

returns 1 row for each department with the number of employees in the department.

Your query uses RANK() window function to rank the departments based on the results of the aggregate function count(*):

rank() over (order by count(*) desc) r

RANK() operates on the results of the aggregate query (1 row for each department with 2 columns: department_id and count(*)) and returns 1 more column for each department.

It would be the same as if you used the aggregate query as a subquery:

select d_id, rank() over (order by number_of_employees desc) r
from (
  select department_id d_id, 
         count(*) number_of_employees
  from az_employees
  group by department_id 
) t

but your query is simpler.

CodePudding user response：

What I don't understand is why group by is required.

Because you are using aggregate count in window function.

We can't use aggregate function in window function, If I understand correctly you can try to use a subquery first

SELECT *,rank() over (order by cnt desc) r
FROM (
    SELECT *,COUNT(*) OVER(PARTITION BY department_id)  cnt
    FROM az_employees
) t1

CodePudding user response：

This is sort of an answer to another question but might be what you really want.

select
department_id d_id,
count(*) count // this line can be removed
from az_employees
group by department_id
order by count(*)
having count(*) > 0 // this line can be removed, but if you for instance change zero to one one departments with more than one employee will be shown.

as previous answer explained, group by is used to aggregate. So no reason to go into details

While I don't know this scenario it's generally a bad habit to remove original data unless there is some gain from it, and as rank is usually only used to sort by later on and that can be done just as easily with the original value. But there is of-course situations when rank is a good solution, but I don't think this is one of those.