I'm fairly new to window functions and have been learning about them recently. There is this query which returns the total sales for each month and quarter using a group by and an aggregate function. Additionally, it returns the total sales for the whole year (using a window function) and the max total sales for each quarter (using a window function). This all makes sense to me.
Query:
SELECT
quarter(payment_date) quarter,
monthname(payment_date) month_name,
SUM(amount) monthly_sales,
MAX(SUM(amount)) OVER() max_overall_values,
MAX(SUM(amount)) OVER(PARTITION BY quarter(payment_date)) max_quarter_sales
FROM
payment
WHERE
year(payment_date) = 2005
GROUP BY
quarter(payment_date),
monthname(payment_date)
ORDER BY
monthname(payment_date) DESC;
Results:
--------- ------------ --------------- -------------------- -------------------
| quarter | month_name | monthly_sales | max_overall_values | max_quarter_sales |
--------- ------------ --------------- -------------------- -------------------
| 2 | May | 4824.43 | 28373.89 | 9631.88 |
| 2 | June | 9631.88 | 28373.89 | 9631.88 |
| 3 | July | 28373.89 | 28373.89 | 28373.89 |
| 3 | August | 24072.13 | 28373.89 | 28373.89 |
--------- ------------ --------------- -------------------- -------------------
I start to loose track of what's going on if I remove "max":
SELECT
quarter(payment_date) quarter,
monthname(payment_date) month_name,
SUM(amount) monthly_sales,
SUM(amount) OVER() max_overall_values,
SUM(amount) OVER(PARTITION BY quarter(payment_date)) max_quarter_sales
FROM
payment
WHERE
year(payment_date) = 2005
GROUP BY
quarter(payment_date),
monthname(payment_date)
ORDER BY
monthname(payment_date) DESC;
I get the following results:
--------- ------------ --------------- -------------------- -------------------
| quarter | month_name | monthly_sales | max_overall_values | max_quarter_sales |
--------- ------------ --------------- -------------------- -------------------
| 2 | May | 4824.43 | 19.96 | 8.98 |
| 2 | June | 9631.88 | 19.96 | 8.98 |
| 3 | July | 28373.89 | 19.96 | 10.98 |
| 3 | August | 24072.13 | 19.96 | 10.98 |
--------- ------------ --------------- -------------------- -------------------
My question is what data is the "max" window function actually processing when it's used in this context, which involves a group by clause, and how is it arriving at the calculation of 19.96 for max_overall_values, 8.98 for a quarter and 10.98 for the other when "max" is removed?
CodePudding user response:
If you leave off the sum, the amount summed by the window function will be the amount that would have been returned if it were included in the select columns (e.g. SELECT amount, quarter(payment_date) quarter...).
Because different rows being grouped will have different amounts, this amount will be the amount column from an arbitrary row of those being grouped. This is almost never a useful thing, and indeed modern versions of mysql have a default ONLY_FULL_GROUP_BY sql_mode that returns an error if you attempt this in your query:
Expression #4 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'fiddle.payment.amount' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
CodePudding user response:
Just sharing some findings.
Experiment
Prepare data as below. Two identical tables: payment1
and payment2
but rows were inserted in opposite order. (create table payment2 as select * from payment1 order by payment_date desc
)
payment_date|amount|
------------ ------
2020-05-01| 1|
2020-05-02| 2|
2020-05-03| 3|
2020-06-01| 4|
2020-06-02| 5|
2020-06-03| 6|
2020-07-01| 7|
2020-07-02| 8|
2020-07-03| 9|
2020-08-01| 10|
2020-08-02| 11|
2020-08-03| 12|
Calculating monthly_sales for reference:
quarter|month_name|amount|monthly_sales|
------- ---------- ------ -------------
2|May | 1| 6|
2|May | 2| 6|
2|May | 3| 6|
2|June | 4| 15|
2|June | 5| 15|
2|June | 6| 15|
3|July | 7| 24|
3|July | 8| 24|
3|July | 9| 24|
3|August | 10| 33|
3|August | 11| 33|
3|August | 12| 33|
- apply
max(sum())
and everything looks good.
quarter|month_name|monthly_sales|max_overall_amount|max_quarter_sales|
------- ---------- ------------- ------------------ -----------------
2|May | 6| 33| 15|
2|June | 15| 33| 15|
3|July | 24| 33| 33|
3|August | 33| 33| 33|
- apply
sum()
only. It seems that only first row of each month is picked (I tried use CTE to order bypayment_date
ASC
orDESC
but no difference)
quarter|month_name|monthly_sales|sum_overall_amount|sum_quarter_sales|
------- ---------- ------------- ------------------ -----------------
2|May | 6| 22| 5|
2|June | 15| 22| 5|
3|July | 24| 22| 17|
3|August | 33| 22| 17|
-- 22 = 1 4 7 10 --> first row of each month
-- 5 = 1 4 --> first row of May, June
-- 17 = 7 10 --> first row of July, August
query against payment2
:
quarter|month_name|monthly_sales|sum_overall_amount|sum_quarter_sales|
------- ---------- ------------- ------------------ -----------------
2|May | 6| 30| 9|
2|June | 15| 30| 9|
3|July | 24| 30| 21|
3|August | 33| 30| 21|
-- 30 = 3 6 9 12 --> first row of each month in `payment2`
- apply max().
quarter|month_name|monthly_sales|max_overall_amount|max_quarter_sales|
------- ---------- ------------- ------------------ -----------------
2|May | 6| 10| 4|
2|June | 15| 10| 4|
3|July | 24| 10| 10|
3|August | 33| 10| 10|
-- 4 = max(1, 4) --> first row of May , June
-- 10 = max(7, 10) --> first row of July, August
query against payment2
:
quarter|month_name|monthly_sales|max_overall_amount|max_quarter_sales|
------- ---------- ------------- ------------------ -----------------
2|May | 6| 12| 6|
2|June | 15| 12| 6|
3|July | 24| 12| 12|
3|August | 33| 12| 12|
-- 6 = max(3, 6) --> first row of May, Jun in `payment2`
-- 12 = max(9, 12) --> first row of July, August in `payment2`
- apply min():
quarter|month_name|monthly_sales|min_overall_amount|min_quarter_sales|
------- ---------- ------------- ------------------ -----------------
2|May | 6| 1| 1|
2|June | 15| 1| 1|
3|July | 24| 1| 7|
3|August | 33| 1| 7|
-- 1 = min(1, 4) --> first row of May, June
-- 7 = min(7, 10) --> first row of July, August
query against payment2
:
quarter|month_name|monthly_sales|min_overall_amount|min_quarter_sales|
------- ---------- ------------- ------------------ -----------------
2|May | 6| 3| 3|
2|June | 15| 3| 3|
3|July | 24| 3| 9|
3|August | 33| 3| 9|
-- 3 = min(3, 6) --> first row of May, June in `payment2`
-- 9 = min(9, 12) --> first row of July, August in `payment2`
Environment:
- MySQL 8.0.26 on Centos Stream 8
- Disabled
ONLY_FULL_GROUP_BY
in session.
Observations:
GROUP BY
quarter & month seems to make MySQL to pick the first row in physical location of each group (May, June, July, August) and then apply aggregate functions (sum(), max(), min())- data in
payment1
andpayment1
are identical but stored in opposite order and the same queries ran against the two tables returned different results. - data stored order in a table may change over time. Therefore, the outcome could be also changed.
- If the above are true, disable
ONLY_FULL_GROUP_BY
sql_mode
may be harmful due to inconsistent results.
A Question for @Robert: Could you please run similar experiments on your end to see what the outcome looks like? My environment is very limited. Thanks