I am working with two different covid datasets which include reported data from all countries. The issue I am having is with the sum
and group by
functions, where SQL is summing all grouped rows which unnecessarily increases the output. With the example below, the total deaths should be 4,849 with 17,441 total cases.
*Note: The datasets included the city/province for each country, and the below is already grouped by country
CREATE TABLE covid.deaths(
country varchar(255),
deaths int
);
CREATE TABLE covid.confirmed_cases(
country varchar(255),
cases int
);
*country* *deaths* *cases*
China 0 747
China 0 519
China 0 1500
China 0 264
China 1 159
China 3 1992
China 2 1008
China 4 1632
China 7 1650
China 6 190
China 213 1260
China 8 1197
China 58 3260
China 13 362
China 22 703
China 4512 998
However when I run the query below I get 279,056 total cases and 77,584 deaths. In trying to self-solve this issue, I removed one of the sum
functions (but kept the date) and found that with the output, SQL is populating the total down the column for all rows and then summing those columns when both sum
functions are present (essentially the equation SQL is using is total deaths or cases x number of rows
).
SELECT
COALESCE(d.country_region, "Unknown") AS country,
SUM(d._11_16_21) as deaths,
SUM(c._11_16_21) as cases
FROM `covid.deaths` as d
JOIN `covid.confirmed_cases` as c
ON d.country_region = c.country_region
WHERE d.country_region = "China"
GROUP BY
d.country_region
Output with the cases sum
function removed
SELECT
COALESCE(d.country_region, "Unknown") AS country,
SUM(d._11_16_21) as deaths,
c._11_16_21 as cases
FROM `covid.deaths` as d
JOIN `covid.confirmed_cases` as c
ON d.country_region = c.country_region
WHERE d.country_region = "China"
GROUP BY
d.country_region, c._11_16_21
*country* *deaths* *cases*
China 4849 747
China 4849 519
China 4849 1500
China 4849 264
China 4849 159
China 4849 1992
China 4849 1008
China 4849 1632
China 4849 1650
China 4849 190
China 4849 1260
China 4849 1197
China 4849 3260
China 4849 362
China 4849 703
China 4849 998
Is there a way to have SQL only show the unique sum as the output?
*Note: Coalesce is being used as there are some country names which are null
CodePudding user response:
Sum them separately before joining covid cases to deaths
SELECT
COALESCE(d.country_region, c.country_region, 'Unknown') AS country,
d.deaths,
c.cases
FROM (
SELECT country_region
, SUM(`_11_16_21`) as deaths
FROM `covid.deaths`
WHERE country_region = 'China'
GROUP BY country_region
) as d
JOIN (
SELECT country_region
, SUM(`_11_16_21`) as cases
FROM `covid.confirmed_cases`
WHERE country_region = 'China'
GROUP BY country_region
) as c
ON c.country_region = d.country_region