Home > Software design >  SQL window function group by
SQL window function group by

Time:12-02

I'm fairly new to window functions and have been learning about them recently. There is this query which returns the total sales for each month and quarter using a group by and an aggregate function. Additionally, it returns the total sales for the whole year (using a window function) and the max total sales for each quarter (using a window function). This all makes sense to me.

Query:

SELECT
    quarter(payment_date) quarter,
    monthname(payment_date) month_name,
    SUM(amount) monthly_sales,
    MAX(SUM(amount)) OVER() max_overall_values,
    MAX(SUM(amount)) OVER(PARTITION BY quarter(payment_date)) max_quarter_sales
FROM
    payment
WHERE
    year(payment_date) = 2005
GROUP BY
    quarter(payment_date),
    monthname(payment_date)
ORDER BY
    monthname(payment_date) DESC;

Results:

 --------- ------------ --------------- -------------------- ------------------- 
| quarter | month_name | monthly_sales | max_overall_values | max_quarter_sales |
 --------- ------------ --------------- -------------------- ------------------- 
|       2 | May        |       4824.43 |           28373.89 |           9631.88 |
|       2 | June       |       9631.88 |           28373.89 |           9631.88 |
|       3 | July       |      28373.89 |           28373.89 |          28373.89 |
|       3 | August     |      24072.13 |           28373.89 |          28373.89 |
 --------- ------------ --------------- -------------------- ------------------- 

I start to loose track of what's going on if I remove "max":

SELECT
    quarter(payment_date) quarter,
    monthname(payment_date) month_name,
    SUM(amount) monthly_sales,
    SUM(amount) OVER() max_overall_values,
    SUM(amount) OVER(PARTITION BY quarter(payment_date)) max_quarter_sales
FROM
    payment
WHERE
    year(payment_date) = 2005
GROUP BY
    quarter(payment_date),
    monthname(payment_date)
ORDER BY
    monthname(payment_date) DESC;

I get the following results:

 --------- ------------ --------------- -------------------- ------------------- 
| quarter | month_name | monthly_sales | max_overall_values | max_quarter_sales |
 --------- ------------ --------------- -------------------- ------------------- 
|       2 | May        |       4824.43 |              19.96 |              8.98 |
|       2 | June       |       9631.88 |              19.96 |              8.98 |
|       3 | July       |      28373.89 |              19.96 |             10.98 |
|       3 | August     |      24072.13 |              19.96 |             10.98 |
 --------- ------------ --------------- -------------------- ------------------- 

My question is what data is the "max" window function actually processing when it's used in this context, which involves a group by clause, and how is it arriving at the calculation of 19.96 for max_overall_values, 8.98 for a quarter and 10.98 for the other when "max" is removed?

CodePudding user response:

If you leave off the sum, the amount summed by the window function will be the amount that would have been returned if it were included in the select columns (e.g. SELECT amount, quarter(payment_date) quarter...).

Because different rows being grouped will have different amounts, this amount will be the amount column from an arbitrary row of those being grouped. This is almost never a useful thing, and indeed modern versions of mysql have a default ONLY_FULL_GROUP_BY sql_mode that returns an error if you attempt this in your query:

Expression #4 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'fiddle.payment.amount' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

fiddle

CodePudding user response:

Just sharing some findings.

Experiment

Prepare data as below. Two identical tables: payment1 and payment2 but rows were inserted in opposite order. (create table payment2 as select * from payment1 order by payment_date desc)

payment_date|amount|
------------ ------ 
  2020-05-01|     1|
  2020-05-02|     2|
  2020-05-03|     3|
  2020-06-01|     4|
  2020-06-02|     5|
  2020-06-03|     6|
  2020-07-01|     7|
  2020-07-02|     8|
  2020-07-03|     9|
  2020-08-01|    10|
  2020-08-02|    11|
  2020-08-03|    12|

Calculating monthly_sales for reference:

quarter|month_name|amount|monthly_sales|
------- ---------- ------ ------------- 
      2|May       |     1|            6|
      2|May       |     2|            6|
      2|May       |     3|            6|
      2|June      |     4|           15|
      2|June      |     5|           15|
      2|June      |     6|           15|
      3|July      |     7|           24|
      3|July      |     8|           24|
      3|July      |     9|           24|
      3|August    |    10|           33|
      3|August    |    11|           33|
      3|August    |    12|           33|
  1. apply max(sum()) and everything looks good.
quarter|month_name|monthly_sales|max_overall_amount|max_quarter_sales|
------- ---------- ------------- ------------------ ----------------- 
      2|May       |            6|                33|               15|
      2|June      |           15|                33|               15|
      3|July      |           24|                33|               33|
      3|August    |           33|                33|               33|
  1. apply sum() only. It seems that only first row of each month is picked (I tried use CTE to order by payment_date ASC or DESC but no difference)
quarter|month_name|monthly_sales|sum_overall_amount|sum_quarter_sales|
------- ---------- ------------- ------------------ ----------------- 
      2|May       |            6|                22|                5|
      2|June      |           15|                22|                5|
      3|July      |           24|                22|               17|
      3|August    |           33|                22|               17|

-- 22 = 1   4   7   10 --> first row of each month
--  5 = 1   4          --> first row of May, June
-- 17 = 7   10         --> first row of July, August

query against payment2:

quarter|month_name|monthly_sales|sum_overall_amount|sum_quarter_sales|
------- ---------- ------------- ------------------ ----------------- 
      2|May       |            6|                30|                9|
      2|June      |           15|                30|                9|
      3|July      |           24|                30|               21|
      3|August    |           33|                30|               21|

-- 30 = 3   6   9   12  --> first row of each month  in `payment2`
  1. apply max().
quarter|month_name|monthly_sales|max_overall_amount|max_quarter_sales|
------- ---------- ------------- ------------------ ----------------- 
      2|May       |            6|                10|                4|
      2|June      |           15|                10|                4|
      3|July      |           24|                10|               10|
      3|August    |           33|                10|               10|

--  4 = max(1,  4) --> first row of May , June
-- 10 = max(7, 10) --> first row of July, August

query against payment2:

quarter|month_name|monthly_sales|max_overall_amount|max_quarter_sales|
------- ---------- ------------- ------------------ ----------------- 
      2|May       |            6|                12|                6|
      2|June      |           15|                12|                6|
      3|July      |           24|                12|               12|
      3|August    |           33|                12|               12|

--  6 = max(3,  6) --> first row of May, Jun in `payment2`
-- 12 = max(9, 12) --> first row of July, August in `payment2`
  1. apply min():
quarter|month_name|monthly_sales|min_overall_amount|min_quarter_sales|
------- ---------- ------------- ------------------ ----------------- 
      2|May       |            6|                 1|                1|
      2|June      |           15|                 1|                1|
      3|July      |           24|                 1|                7|
      3|August    |           33|                 1|                7|

-- 1 = min(1,  4) --> first row of May, June
-- 7 = min(7, 10) --> first row of July, August

query against payment2:

quarter|month_name|monthly_sales|min_overall_amount|min_quarter_sales|
------- ---------- ------------- ------------------ ----------------- 
      2|May       |            6|                 3|                3|
      2|June      |           15|                 3|                3|
      3|July      |           24|                 3|                9|
      3|August    |           33|                 3|                9|

-- 3  = min(3,  6) --> first row of May, June in `payment2`
-- 9  = min(9, 12) --> first row of July, August in `payment2`

Environment:

  • MySQL 8.0.26 on Centos Stream 8
  • Disabled ONLY_FULL_GROUP_BY in session.

Observations:

  • GROUP BY quarter & month seems to make MySQL to pick the first row in physical location of each group (May, June, July, August) and then apply aggregate functions (sum(), max(), min())
  • data in payment1 and payment1 are identical but stored in opposite order and the same queries ran against the two tables returned different results.
  • data stored order in a table may change over time. Therefore, the outcome could be also changed.
  • If the above are true, disable ONLY_FULL_GROUP_BY sql_mode may be harmful due to inconsistent results.

A Question for @Robert: Could you please run similar experiments on your end to see what the outcome looks like? My environment is very limited. Thanks

  • Related