I'm trying to join a new column to my current query that uses aggregate functions. I create this column with a new query that also uses an aggregate function from a different table but I'm not sure if a JOIN
will work for me since I need to join it to its respective row.
TABLE A (employees that are enrolled or were enrolled in a project)
ID | DEPARTMENT | ENROLLED | PROJECT |
---|---|---|---|
1 | MARKETING | Yes | ARQ |
2 | MARKETING | Yes | TC |
3 | MARKETING | No | ARQ |
4 | MARKETING | No | TC |
5 | FINANCE | Yes | ARQ |
6 | FINANCE | Yes | TC |
7 | FINANCE | No | ARQ |
8 | FINANCE | Yes | TC |
This table has more departments and more projects, but I simplified.
TABLE B (relation with departments and employees)
ID | DEPARTMENT | TOTAL_EMPLOYEES |
---|---|---|
1 | MARKETING | 2 |
2 | MARKETING | 3 |
3 | FINANCE | 4 |
4 | FINANCE | 8 |
In my first query I was asked to achieve the following result - using only table A:
(employees enrolled) (employees not enrolled)
DEPARTMENT | ARQ_E | TC_E | TOTAL_ENROLLED | ARQ_N | TC_N | TOTAL_NOT_ENROLLED | TOTAL |
---|---|---|---|---|---|---|---|
MARKETING | 1 | 1 | 2 | 1 | 1 | 2 | 4 |
FINANCE | 1 | 1 | 2 | 1 | 1 | 2 | 4 |
Using the following query:
SELECT tableA.department,
sum(case when enrolled = 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_E,
sum(case when enrolled = 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_E,
sum(case when enrolled = 'Yes' then 1 else 0 end) as TOTAL_ENROLLED,
sum(case when enrolled != 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_N,
sum(case when enrolled != 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_N,
sum(case when enrolled != 'Yes' then 1 else 0 end) as TOTAL_NOT_ENROLLED,
count (*) AS Total
FROM tableA
GROUP BY tableA.department;
My second query gets departments and their total employees from table B:
DEPARTMENT | TOTAL_EMPLOYEES |
---|---|
MARKETING | 5 |
FINANCE | 12 |
Using the following query:
SELECT tableB.department,
sum(tableB.total_employees) AS TOTAL_EMPLOYEES
FROM tableB
GROUP BY tableB.department;
I need to add the column TOTAL_EMPLOYEES
to my first query, next to TOTAL
will be TOTAL_EMPLOYEES
. But it has to be placed with its respective department row. I need this to compare this 2 columns and see how many employees were not assigned to any project.
This is my expected result.
(employees enrolled) (employees not enrolled)
DEPARTMENT | ARQ_E | TC_E | TOTAL_ENROLLED | ARQ_N | TC_N | TOTAL_NOT_ENROLLED | TOTAL | T_EMPL |
---|---|---|---|---|---|---|---|---|
MARKETING | 1 | 1 | 2 | 1 | 1 | 2 | 4 | 5 |
FINANCE | 1 | 1 | 2 | 1 | 1 | 2 | 4 | 12 |
I have tried to achieve this using the following query:
SELECT tableA.department,
sum(case when enrolled = 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_E,
sum(case when enrolled = 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_E,
sum(case when enrolled = 'Yes' then 1 else 0 end) as TOTAL_ENROLLED,
sum(case when enrolled != 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_N,
sum(case when enrolled != 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_N,
sum(case when enrolled != 'Yes' then 1 else 0 end) as TOTAL_NOT_ENROLLED,
count (*) AS Total,
sum (tableB.total_employees) AS T_EMPL
FROM tableA
JOIN tableB
ON tableA.department = tableB.department
GROUP BY tableA.department;
But the numbers I get in my query are completely wrong since the JOINS repeat my rows and my SUMS duplicate.
I don't know if I really need to use a join or a subquery to place my sum(tableB.department)
in its respective row.
I'm using PostgreSQL but since I'm using Standard 92 any SQL solution will help.
CodePudding user response:
Your main issue stemmed from inadvertently multiplying rows with the join, and has already been addressed. See:
But use the standard SQL aggregate FILTER
clause. It's shorter, cleaner, and noticeably faster. See:
- Aggregate columns with additional (distinct) filters
- For absolute performance, is SUM faster or COUNT?
SELECT *
FROM (
SELECT department
, count(*) FILTER (WHERE enrolled AND project = 'ARQ') AS arq_e
, count(*) FILTER (WHERE enrolled AND project = 'TC') AS tc_e
, count(*) FILTER (WHERE enrolled) AS total_enrolled
, count(*) FILTER (WHERE NOT enrolled AND project = 'ARQ') AS arq_n
, count(*) FILTER (WHERE NOT enrolled AND project = 'TC') AS tc_n
, count(*) FILTER (WHERE NOT enrolled) AS total_not_enrolled
, count(*) AS total
FROM tableA a
GROUP BY 1
) a
LEFT JOIN ( -- !
SELECT department
, sum(total_employees) AS total_employees
FROM tableB b
GROUP BY 1
) b USING (department);
enrolled
should be a boolean
column. Make it so if it isn't. Then you can use it directly. Smaller, faster, cleaner, shorter code.
I replaced the [INNER] JOIN
with a LEFT [OUTER] JOIN
on a suspicion. Typically, you want to keep all results, even if the same department is not found in the other table. Maybe even a FULL [OUTER] JOIN
?
Also, USING (department)
as join condition conveniently outputs that column only once, so we can make do with SELECT *
in the outer SELECT
.
Finally, subqueries are shorter and faster than CTEs. Not much since Postgres 12, but still. See:
CodePudding user response:
Join the results of the two queries, using sub-queries, don't join the tables.
That way you're joining 1 row of enrollment data per department to 1 row of employee data per department.
SELECT
*
FROM
(
SELECT tableA.department,
sum(case when enrolled = 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_E,
sum(case when enrolled = 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_E,
sum(case when enrolled = 'Yes' then 1 else 0 end) as TOTAL_ENROLLED,
sum(case when enrolled != 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_N,
sum(case when enrolled != 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_N,
sum(case when enrolled != 'Yes' then 1 else 0 end) as TOTAL_NOT_ENROLLED,
count (*) AS Total
FROM tableA
GROUP BY tableA.department
)
AS enroll
INNER JOIN
(
SELECT tableB.department,
sum(tableB.total_employees) AS Total_EMPLOYEES
FROM tableB
GROUP BY tableB.department
)
AS employee
ON employee.department = enroll.department
CodePudding user response:
As the join will multiply the summ, you can first sum the values and then join them
WITH CTE1 as (SELECT tableA.department,
sum(case when enrolled = 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_E,
sum(case when enrolled = 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_E,
sum(case when enrolled = 'Yes' then 1 else 0 end) as TOTAL_ENROLLED,
sum(case when enrolled != 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_N,
sum(case when enrolled != 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_N,
sum(case when enrolled != 'Yes' then 1 else 0 end) as TOTAL_NOT_ENROLLED,
count (*) AS Total
FROM tableA
GROUP BY tableA.department),
CTE2 as (SELECT tableB.department,
sum(tableB.total_employees) AS TOTAL_EMPLOYEES
FROM tableB
GROUP BY tableB.department)
SELECT
CTE1.department, ARQ_E, TC_E, TOTAL_ENROLLED, ARQ_N, TC_N, TOTAL_NOT_ENROLLED, TOTAL, T_EMPL,CTE2.TOTAL_EMPLOYEES
FROM CTE1 JOIN CTE2 ON CTE1.department = CTE2.department