Is there a better way to formulate this SQL query?-CodePudding

I am very new to the SQL universe, and I came across this prompt that I was able to fulfill, but I have to imagine I'm missing a more direct and intuitive solution. My solution returns the correct response in SQLite within rounding error to over 10 decimal places but technically does not match the reported solution. I appreciate any insight.

Prompt: Find the difference between the average rating ["stars"] of movies released before 1980 and the average rating of movies released after 1980. (The difference between the average of averages before and after.)

The database includes 3 tables with the following columns (simplified for relevance):

movie| mID*, year
reviewer| rID*, name
rating| rID*, mI*, stars
"mavg" is my own aliased aggregation

select distinct(
        (select avg(mavg)
         from(
               (select *, avg(stars) as mavg
                from rating
                group by mID) join movie using(mID) )
                where year < 1980)    -
        (select avg(mavg)
         from(
               (select *, avg(stars) as mavg
                from rating
                group by mID) join movie using(mID) )
                where year >= 1980)
              )
from rating
;

CodePudding user response：

You may use the following single query here:

SELECT AVG(CASE WHEN m.year < 1980 THEN r.stars END) -
       AVG(CASE WHEN m.year >= 1980 THEN r.stars END) AS mavg
FROM rating r
INNER JOIN movie m ON m.mID = r.mID;

CodePudding user response：

Let's look at your subquery:

select *, avg(stars) as mavg
from rating
group by mID

This is an invalid query. With GROUP BY mid you say you want to aggregate your rows to get one result row per mID. But then you don't only select the average rating, but all columns from the table (SELECT *). One of these columns is stars. How can you select the stars column into one row, when there are many rows for an mID? Most DBMS report a syntax error here. SQLite picks one of the stars from any of the mID's rows arbitrarily instead. So, while this is considered valid in SQLite, it isn't in standard SQL, and you should not write such queries.

To the result (the average per movie) you join the movies table. And then you select the average of the movie ratings for the movies in the desired years. This is well done, but you could have put that restriction (join or IN clause or EXISTS clause) right into the subquery in order to only calculate the averages for the movies you want, rather then calculating all averages and then only keep some of the movies and dismiss others. But that's a minor detail.

Then you subtract the new average from the old one. This means you subtract one value from another and end up with exactly the one value you want to show. But instead of merely selecting this value (SELECT (...) - (...)) you are linking the value with the rating table (SELECT (...) - (...) FROM rating) for no apparent reason, thus selecting the desired value as often as there are rows in the rating table. You then notice this and apply DISTINCT to get rid of the rows you just created unnecessarily yourself. DISTINCT is very, very often an indicator for a badly written query. When you think you need DISTINCT, ask yourself what makes this necessary. Where do the duplicate rows come from? Have you created them yourself? The amend this.

The query can be written thus:

select
  avg(case when m.year < 1980 then r.movie_rating end) -
  avg(case when m.year >= 1980 then r.movie_rating end) as diff
from
(
  select mid, avg(stars) as movie_rating
  from rating
  group by mid
) r
join movie m using (mid);

Using a case expression inside an aggregation function is called conditional aggregation and is often the preferred solution when working with diferent aggregates.