INNER JOIN with aggregate functions in my SELECT-CodePudding

I'm trying to join a new column to my current query that uses aggregate functions. I create this column with a new query that also uses an aggregate function from a different table but I'm not sure if a JOIN will work for me since I need to join it to its respective row.

TABLE A (employees that are enrolled or were enrolled in a project)

ID	DEPARTMENT	ENROLLED	PROJECT
1	MARKETING	Yes	ARQ
2	MARKETING	Yes	TC
3	MARKETING	No	ARQ
4	MARKETING	No	TC
5	FINANCE	Yes	ARQ
6	FINANCE	Yes	TC
7	FINANCE	No	ARQ
8	FINANCE	Yes	TC

This table has more departments and more projects, but I simplified.

TABLE B (relation with departments and employees)

ID	DEPARTMENT	TOTAL_EMPLOYEES
1	MARKETING	2
2	MARKETING	3
3	FINANCE	4
4	FINANCE	8

In my first query I was asked to achieve the following result - using only table A:

             (employees enrolled)               (employees not enrolled)

DEPARTMENT	ARQ_E	TC_E	TOTAL_ENROLLED	ARQ_N	TC_N	TOTAL_NOT_ENROLLED	TOTAL
MARKETING	1	1	2	1	1	2	4
FINANCE	1	1	2	1	1	2	4

Using the following query:

SELECT  tableA.department, 
        sum(case when enrolled = 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_E,
        sum(case when enrolled = 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_E,
        sum(case when enrolled = 'Yes' then 1 else 0 end) as TOTAL_ENROLLED,
        sum(case when enrolled != 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_N,
        sum(case when enrolled != 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_N,
        sum(case when enrolled != 'Yes' then 1 else 0 end) as TOTAL_NOT_ENROLLED,
        count (*) AS Total               
FROM    tableA
GROUP BY tableA.department;

My second query gets departments and their total employees from table B:

DEPARTMENT	TOTAL_EMPLOYEES
MARKETING	5
FINANCE	12

Using the following query:

SELECT  tableB.department,
        sum(tableB.total_employees) AS TOTAL_EMPLOYEES
FROM tableB
GROUP BY tableB.department;

I need to add the column TOTAL_EMPLOYEES to my first query, next to TOTAL will be TOTAL_EMPLOYEES. But it has to be placed with its respective department row. I need this to compare this 2 columns and see how many employees were not assigned to any project.

This is my expected result.

                  (employees enrolled)               (employees not enrolled)

DEPARTMENT	ARQ_E	TC_E	TOTAL_ENROLLED	ARQ_N	TC_N	TOTAL_NOT_ENROLLED	TOTAL	T_EMPL
MARKETING	1	1	2	1	1	2	4	5
FINANCE	1	1	2	1	1	2	4	12

I have tried to achieve this using the following query:

SELECT  tableA.department, 
        sum(case when enrolled  = 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_E,
        sum(case when enrolled  = 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_E,
        sum(case when enrolled  = 'Yes' then 1 else 0 end) as TOTAL_ENROLLED,
        sum(case when enrolled  != 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_N,
        sum(case when enrolled  != 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_N,
        sum(case when enrolled  != 'Yes' then 1 else 0 end) as TOTAL_NOT_ENROLLED,
        count (*) AS Total,
        sum (tableB.total_employees) AS T_EMPL               
FROM    tableA
JOIN    tableB
ON tableA.department = tableB.department
GROUP BY tableA.department;

But the numbers I get in my query are completely wrong since the JOINS repeat my rows and my SUMS duplicate.

I don't know if I really need to use a join or a subquery to place my sum(tableB.department) in its respective row.

I'm using PostgreSQL but since I'm using Standard 92 any SQL solution will help.

CodePudding user response：

Your main issue stemmed from inadvertently multiplying rows with the join, and has already been addressed. See:

Two SQL LEFT JOINS produce incorrect result

But use the standard SQL aggregate FILTER clause. It's shorter, cleaner, and noticeably faster. See:

SELECT *
FROM  (
   SELECT department
        , count(*) FILTER (WHERE enrolled AND project = 'ARQ')     AS arq_e
        , count(*) FILTER (WHERE enrolled AND project = 'TC')      AS tc_e
        , count(*) FILTER (WHERE enrolled)                         AS total_enrolled
        , count(*) FILTER (WHERE NOT enrolled AND project = 'ARQ') AS arq_n
        , count(*) FILTER (WHERE NOT enrolled AND project = 'TC')  AS tc_n
        , count(*) FILTER (WHERE NOT enrolled)                     AS total_not_enrolled
        , count(*) AS total
   FROM   tableA a
   GROUP  BY 1
   ) a
LEFT JOIN  (  -- !
   SELECT department
        , sum(total_employees) AS total_employees
   FROM   tableB b
   GROUP  BY 1
   ) b USING (department);

enrolled should be a boolean column. Make it so if it isn't. Then you can use it directly. Smaller, faster, cleaner, shorter code.

I replaced the [INNER] JOIN with a LEFT [OUTER] JOIN on a suspicion. Typically, you want to keep all results, even if the same department is not found in the other table. Maybe even a FULL [OUTER] JOIN?

Also, USING (department) as join condition conveniently outputs that column only once, so we can make do with SELECT * in the outer SELECT.

Finally, subqueries are shorter and faster than CTEs. Not much since Postgres 12, but still. See:

Does WITH query store the results of referred tables?

CodePudding user response：

Join the results of the two queries, using sub-queries, don't join the tables.

That way you're joining 1 row of enrollment data per department to 1 row of employee data per department.

SELECT
  *
FROM
(
  SELECT  tableA.department, 
          sum(case when enrolled  = 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_E,
          sum(case when enrolled  = 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_E,
          sum(case when enrolled  = 'Yes' then 1 else 0 end) as TOTAL_ENROLLED,
          sum(case when enrolled  != 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_N,
          sum(case when enrolled  != 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_N,
          sum(case when enrolled  != 'Yes' then 1 else 0 end) as TOTAL_NOT_ENROLLED,
          count (*) AS Total
               
  FROM    tableA
  GROUP BY tableA.department
)
  AS enroll
INNER JOIN
(
  SELECT  tableB.department,
          sum(tableB.total_employees) AS Total_EMPLOYEES
  FROM tableB
  GROUP BY tableB.department
)
  AS employee
    ON employee.department = enroll.department

CodePudding user response：

As the join will multiply the summ, you can first sum the values and then join them

WITH CTE1 as (SELECT  tableA.department, 
        sum(case when enrolled  = 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_E,
        sum(case when enrolled  = 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_E,
        sum(case when enrolled  = 'Yes' then 1 else 0 end) as TOTAL_ENROLLED,
        sum(case when enrolled  != 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_N,
        sum(case when enrolled  != 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_N,
        sum(case when enrolled  != 'Yes' then 1 else 0 end) as TOTAL_NOT_ENROLLED,
        count (*) AS Total
               
FROM    tableA
GROUP BY tableA.department),
CTE2 as (SELECT  tableB.department,
        sum(tableB.total_employees) AS TOTAL_EMPLOYEES

FROM tableB
GROUP BY tableB.department)
SELECT
CTE1.department,    ARQ_E,  TC_E,   TOTAL_ENROLLED,     ARQ_N,  TC_N,   TOTAL_NOT_ENROLLED,     TOTAL,  T_EMPL,CTE2.TOTAL_EMPLOYEES
FROM CTE1 JOIN CTE2 ON CTE1.department = CTE2.department