SQL - Same table - Having c2.Country <> c.Country - need to understand-CodePudding

I tried the following SQL query and did not understand how last part in the parenthesis

SELECT COUNT(c2.CustomerID) 
FROM Customers AS c2 
GROUP BY c2.Country 
HAVING c2.Country <> c.Country

works.

SELECT COUNT(c.CustomerID) AS cnt, c.Country
FROM Customers c
GROUP BY c.Country
HAVING cnt NOT IN (SELECT COUNT(c2.CustomerID) 
                   FROM Customers AS c2 
                   GROUP BY c2.Country 
                   HAVING c2.Country <> c.Country);

The first three lines give me country name and count of their customer id.

For example:

Argentina 3
Belgium   3
Brazil    9
IRAN      2
Turkey    2

The last part

SELECT COUNT(c2.CustomerID) 
FROM Customers AS c2 
GROUP BY c2.Country 
HAVING c2.Country <> c.Country

only returns count customer id of country for which there is other countries with the same number of count. How does this work?

For example:

3
2

and don't give 9 for the brazil.

I don't really understand the second condition of the HAVING query works: Having c2.Country <> c.Country. I know what c2.Country <> c.Country means. I don't understand how the bold line works.

When The query in the parenthesis is linked to the original query with a "HAVING NOT IN" which essentially compares the values from the first part of the query and the values from the second (Argentina 3, Belgium 3, IRAN 2, Turkey 2) and will return all rows that are not present in the second query, which is Brazil 9.

CodePudding user response：

Fiddle

Given:

CREATE TABLE Customers (
     CustomerID int PRIMARY KEY AUTO_INCREMENT
   , Country VARCHAR(20)
);

INSERT INTO Customers (Country) VALUES
    ('Argentina')
  , ('Argentina')
  , ('Argentina')
  , ('Belgium')
  , ('Belgium')
  , ('Belgium')
  , ('Brazil')
  , ('Brazil')
  , ('Brazil')
  , ('Brazil')
  , ('Brazil')
  , ('Brazil')
  , ('Brazil')
  , ('Brazil')
  , ('Brazil')
  , ('IRAN')
  , ('IRAN')
  , ('Turkey')
  , ('Turkey')
;

and the SQL:

SELECT COUNT(c.CustomerID) as cnt, c.Country
  FROM Customers c
 GROUP BY c.Country
Having cnt NOT IN (SELECT COUNT(c2.CustomerID) FROM Customers AS c2 GROUP BY c2.Country Having c2.Country <> c.Country)
;

The above SQL has dependent / correlated behavior, where the subquery has a reference to the outer row (c.Country).

Functionally, that means we execute the subquery for each row generated by the outer GROUP BY clause.

This also means we have to be careful to only reference the outer row with expressions that are functionally dependent on the outer GROUP BY terms.

First, let's look at the entire result of the outer GROUP BY logic (without the HAVING logic):

Country	cnt
Argentina	3
Belgium	3
Brazil	9
IRAN	2
Turkey	2

Let's take just one outer row as an example:

Country = 'IRAN', cnt = 2

When we execute the subquery for this row, we effectively do this:

(SELECT COUNT(c2.CustomerID) FROM Customers AS c2 GROUP BY c2.Country Having c2.Country <> 'IRAN')

This produces a list of (only selecting the cnt):

Country	cnt
Argentina	3
Belgium	3
Brazil	9
Turkey	2

But without 'IRAN' due to the HAVING clause: Having c2.Country <> c.Country

The list of cnt produced is: 3, 3, 9, 2

Now, since the outer row is associated with cnt = 2, the NOT IN clause eliminates that outer row from the final result.

The same thing happens to Turkey, Argentina and Belgium, as we process those outer rows.

The only remaining outer row is: Brazil with cnt = 9

Let me know if you need more information.

CodePudding user response：

The <> is a not equals operator which will return any countries with a customer ID count that is not equal to another country (in this instance Brazil 9). This query in the parenthesis is then linked to the original query with a "HAVING NOT IN" which essentially compares the values from the first part of the query and the values from your second (Brazil 9) and will return all rows that are not present in the second query - the one in your parenthesis. If there was another country with a customer ID count of 9 then Brazil would appear in your final results.

I do agree with @jarlh though that this query is rather odd and likely not fully optimal.

CodePudding user response：

First things first, your query should return the count first then the name of the country. SELECT COUNT(c.CustomerID) as cnt, c.Country not the vice versa. Secondly, <> returns the rows where two columns are not equal to each other, which in this case is the country in the outer query does not equal to the country in the inner query.