Home > Software engineering >  SQL WHERE function filter-unexpected answer
SQL WHERE function filter-unexpected answer

Time:12-07

The problem looks like this:

Write an SQL query that selects the product id, year, quantity, and price for the first year of every product sold.

Return the resulting table in any order.

A simple example of the question:

Input: 
Sales table:
 --------- ------------ ------ ---------- ------- 
| sale_id | product_id | year | quantity | price |
 --------- ------------ ------ ---------- -------  
| 1       | 100        | 2008 | 10       | 5000  |
| 2       | 100        | 2009 | 12       | 5000  |
| 7       | 200        | 2011 | 15       | 9000  |
 --------- ------------ ------ ---------- ------- 
Product table:
 ------------ -------------- 
| product_id | product_name |
 ------------ -------------- 
| 100        | Nokia        |
| 200        | Apple        |
| 300        | Samsung      |
 ------------ -------------- 
Output: 
 ------------ ------------ ---------- ------- 
| product_id | first_year | quantity | price |
 ------------ ------------ ---------- -------  
| 100        | 2008       | 10       | 5000  |
| 200        | 2011       | 15       | 9000  |
 ------------ ------------ ---------- ------- 

My code:

SELECT product_id, year AS first_year, quantity, price
FROM Sales
WHERE year IN (
SELECT MIN(year) as year
FROM Sales
GROUP BY product_id) ;

My code worked with the above simple example but failed on a longer test case.

The expected query:

SELECT product_id, year AS first_year, quantity, price
FROM Sales
WHERE (product_id, year) IN (
SELECT product_id, MIN(year) as year
FROM Sales
GROUP BY product_id) ;

So I don't understand why I have to put productid when filtering with where. Doesn't SQL automatically choose the corresponding product_id with the first year? Any hints would be greatly appreciated!

CodePudding user response:

Here's a step-by-step of why the first query doesn't work. Note that I've omitted some fields that are unused for the sake of brevity.

Imagine your sales data contained the following data:

product_id year
100 2008
100 2011
100 2011
100 2011
100 2011
200 2011

Based on this data, the inner subquery of your first query:

SELECT MIN(year) as year
FROM Sales
GROUP BY product_id

will produce a result as follows:

MIN(year)
2008
2011

And so then your query is effectively doing the following:

SELECT product_id, year AS first_year
FROM Sales
WHERE year IN (2008, 2011)

So this query is going to find all the sales that occurred in 2008 and all the sales that occurred in 2011. It is not going to filter by product_id as that is not specified in the WHERE statement. So it'll yield results as follows which is not what you want:

product_id first_year
100 2008
100 2011
100 2011
100 2011
100 2011
200 2011

This is why you need to specify the product_id in your IN statement.

On a general note, when debugging SQL, evaluate the inner-most queries first and then work outward as I have done in this answer.

  • Related