Using CASE and GROUP BY operator-CodePudding

I have a score table like this -

student_id	course_id	score
01	01	80
01	02	90
01	03	99
02	02	70
02	03	60

I want to generate a table showing a student's scores in one row, and if there is no score, show it as NULL. Like this -

student_id	01	02	03
01	80	90	99
02	NULL	70	60

The code that I came up with is -

SELECT 
    student_id, 
    (CASE WHEN course_id = '01' THEN score ELSE NULL END) AS '01', 
    (CASE WHEN course_id = '02' THEN score ELSE NULL END) AS '02', 
    (CASE WHEN course_id = '03' THEN score ELSE NULL END) AS '03'
FROM table;

And as expected I got three scores in separate rows. Like this -

student_id	01	02	03
01	80	NULL	NULL
01	NULL	90	NULL
01	NULL	NULL	99
02	NULL	70	NULL
02	NULL	NULL	60

The answer says I need to use MAX in front of the CASE statement. Also GROUP BY student_id at the end, or the student with NULL score value will not be shown in the result. Like this -

SELECT 
    student_id, 
    MAX(CASE WHEN course_id = '01' THEN score ELSE NULL END) AS '01', 
    MAX(CASE WHEN course_id = '02' THEN score ELSE NULL END) AS '02', 
    MAX(CASE WHEN course_id = '03' THEN score ELSE NULL END) AS '03',
FROM table
GROUP BY student_id;

My questions are:

How does MAX combine three rows each with one score, into one row with three scores? What happened here?
If I do not use GROUP BY student_id at the end, the result will exclude the student with NULL score value. Why? How does GROUP BY include the student back?

Any help would be appreciated! :)

CodePudding user response：

GROUP By will produce a result set with one row for every student_id.

The MAX will garantue that the higest Value will be in the row for everys student_id as you have one score and multiple NULLs.

To get all Studnent_ids you need to LEFT JOIN all students, so ou have all.

SELECT
 t1.student_id,
COALESCE(`01`,0) as '01',
COALESCE(`02`,0) as '02',
COALESCE(`03`,0) as '03'
FROM
(SELECT DISTINCT student_id   FROM score) t1 
LEFT JOIN (SELECT 
    student_id, 
    MAX(CASE WHEN course_id = '01' THEN score ELSE NULL END) AS '01', 
    MAX(CASE WHEN course_id = '02' THEN score ELSE NULL END) AS '02', 
    MAX(CASE WHEN course_id = '03' THEN score ELSE NULL END) AS '03',
FROM table
GROUP BY student_id) t2 ON t1.student_id = t2.student_id

CodePudding user response：

With your intermediate results, you need to return one row per student_id which is where the group by comes in, it collects all rows together for each unique value of the grouped columns.

On its own that wouldn't get you anywhere, you also need to tell SQL what to do with the other non-grouped columns, which is where max() comes in.

This returns the maximum value per group (i.e. per each unique Student_Id) and automatically excludes NULL values; since there is only one value other than NULL in each column, the required value is returned. Note you can equally use other aggregate functions eg Min() - the end result is the same.

Also note you can drop the else null part as NULL is the default when not matched.

CodePudding user response：

This kind of operation is called "Pivoting", and it uses two steps:

"enlarging the schema", by adding more columns to fit the maximum number of potential values that a column can have (in your specific case, the column is "score"). This operation just affects the amount of columns, while keeping the amount of rows same as the original table. Output for the single id
"aggregating the rows", necessary step as long as we have already distributed the values of your column in different numbered columns. This operation just affects the amount of rows, while keeping the amount of columns same as the output of the previous step. Output for the single id

So how do you do each of these two steps?

In order to "enlarge the schema", you require CASE statements (how many? the amount equals the maximum amount of values you want to extract for each id), which will extract only one value per row, for each id. Given that each new CASE column will have one non-NULL and others NULL-only values (for each id), then the MAX aggregation function comes in handy, and it will select the maximum value between one non-null value and all other NULL values (which results in selecting the non-null value). In order to limit the range of action of the MAX aggregation function to each specific id, we use the GROUP BY, which tells the DBMS to select the maximum value for each field(s) specified in its definition (for this task, only "student_id").

Hopefully this will answer both your doubts regarding how this way of solving pivoting problems work.