Home > Software engineering >  SUM and GROUP BY in SQL totaling issue
SUM and GROUP BY in SQL totaling issue

Time:11-29

I am working with two different covid datasets which include reported data from all countries. The issue I am having is with the sum and group by functions, where SQL is summing all grouped rows which unnecessarily increases the output. With the example below, the total deaths should be 4,849 with 17,441 total cases.

*Note: The datasets included the city/province for each country, and the below is already grouped by country

CREATE TABLE covid.deaths(
    country varchar(255),
    deaths int
);


CREATE TABLE covid.confirmed_cases(
    country varchar(255),
    cases int
);
    *country*   *deaths*    *cases*
    China   0   747
    China   0   519
    China   0   1500
    China   0   264
    China   1   159
    China   3   1992
    China   2   1008
    China   4   1632
    China   7   1650
    China   6   190
    China   213 1260
    China   8   1197
    China   58  3260
    China   13  362
    China   22  703
    China   4512 998

However when I run the query below I get 279,056 total cases and 77,584 deaths. In trying to self-solve this issue, I removed one of the sum functions (but kept the date) and found that with the output, SQL is populating the total down the column for all rows and then summing those columns when both sum functions are present (essentially the equation SQL is using is total deaths or cases x number of rows).

SELECT
    COALESCE(d.country_region, "Unknown") AS country,
    SUM(d._11_16_21) as deaths,
    SUM(c._11_16_21) as cases
FROM `covid.deaths` as d
JOIN `covid.confirmed_cases` as c
    ON d.country_region = c.country_region
WHERE d.country_region = "China"
GROUP BY 
    d.country_region

Output with the cases sum function removed

SELECT
    COALESCE(d.country_region, "Unknown") AS country,
    SUM(d._11_16_21) as deaths,
    c._11_16_21 as cases
FROM `covid.deaths` as d
JOIN `covid.confirmed_cases` as c
    ON d.country_region = c.country_region
WHERE d.country_region = "China"
GROUP BY 
    d.country_region, c._11_16_21  


    *country*   *deaths*    *cases*
    China   4849    747
    China   4849    519
    China   4849    1500
    China   4849    264
    China   4849    159
    China   4849    1992
    China   4849    1008
    China   4849    1632
    China   4849    1650
    China   4849    190
    China   4849    1260
    China   4849    1197
    China   4849    3260
    China   4849    362
    China   4849    703
    China   4849    998

Is there a way to have SQL only show the unique sum as the output?

*Note: Coalesce is being used as there are some country names which are null

CodePudding user response:

Sum them separately before joining covid cases to deaths

SELECT
    COALESCE(d.country_region, c.country_region, 'Unknown') AS country,
    d.deaths,
    c.cases
FROM (
    SELECT country_region
    , SUM(`_11_16_21`) as deaths
    FROM `covid.deaths` 
    WHERE country_region = 'China'
    GROUP BY country_region
) as d
JOIN (
    SELECT country_region
    , SUM(`_11_16_21`) as cases
    FROM  `covid.confirmed_cases`
    WHERE country_region = 'China'
    GROUP BY country_region
) as c
  ON c.country_region = d.country_region
  • Related