I have a table named 'example' like this:
date | num |
---|---|
2023-01-01 | 10 |
2023-01-01 | 10 |
2023-01-02 | 20 |
2023-01-03 | 30 |
2023-01-03 | 30 |
2023-01-04 | 40 |
2023-01-05 | 50 |
2023-01-06 | 60 |
2023-01-07 | 70 |
2023-01-07 | 70 |
2023-01-08 | 80 |
2023-01-09 | 90 |
2023-01-10 | 100 |
I want to group date field by day, week and month.
My Spark SQL code:
(
SELECT date, 'day' AS dateType, sum(num) AS sum
FROM example
GROUP BY date
ORDER BY date
)
UNION ALL
(
SELECT trunc(date, 'WEEK') AS date, 'week' AS dateType, sum(num) AS sum
FROM example
GROUP BY trunc(date, 'WEEK')
ORDER BY trunc(date, 'WEEK')
)
UNION ALL
(
SELECT trunc(date, 'MONTH') AS date, 'month' AS dateType, sum(num) AS sum
FROM example
GROUP BY trunc(date, 'MONTH')
ORDER BY trunc(date, 'MONTH')
);
Result:
date | dateType | sum |
---|---|---|
2023-01-01 | day | 20 |
2023-01-02 | day | 20 |
2023-01-03 | day | 60 |
2023-01-04 | day | 40 |
2023-01-05 | day | 50 |
2023-01-06 | day | 60 |
2023-01-07 | day | 140 |
2023-01-08 | day | 80 |
2023-01-09 | day | 90 |
2023-01-10 | day | 100 |
2022-12-26 | week | 20 |
2023-01-02 | week | 450 |
2023-01-09 | week | 190 |
2023-01-01 | month | 660 |
Duplicate code is the root of all evil in software design. When a system is littered with many snippets of indentical, or nearly identical code, it is indicative of sloppiness, carelessness, and sheer unprofessionalism. It is the guilt-edged responsibility of all software developers to root out and eliminate duplication whenever they find it.
I wrote almost the same code three times, but the date field is different, which does not conform to the principle of clean code. So i want to remove the union all
statement and use group by
only once.
Question: Is there a way to write group by
only once and minimize duplicate code and generate the above result?
The final code like:
SELECT xxx AS date, 'xxx' AS dateType, sum(num) AS sum
FROM example
GROUP BY xxx
ORDER BY xxx;
Both Spark SQL code and PySpark code are acceptable. Thank you in advance.
CodePudding user response:
Your code actually works fine I think. You can also do this :
select date, dateType, sum(num) AS sum from
(
SELECT date, 'day' AS dateType, num
FROM example
UNION ALL
SELECT trunc(date, 'WEEK') AS date, 'week' AS dateType, num
FROM example
UNION ALL
SELECT trunc(date, 'MONTH') AS date, 'month' AS dateType, num
FROM example
) t
GROUP BY date, dateType
ORDER BY date, dateType
You only have one group by
un this case but you cannot get rid of the union all
CodePudding user response:
Groups the rows for each grouping set specified after GROUPING SETS. For example, GROUP BY GROUPING SETS ((warehouse), (product)) is semantically equivalent to union of results of GROUP BY warehouse and GROUP BY product. This clause is a shorthand for a UNION ALL where each leg of the UNION ALL operator performs aggregation of each grouping set specified in the GROUPING SETS clause. Similarly, GROUP BY GROUPING SETS ((warehouse, product), (product), ()) is semantically equivalent to the union of results of GROUP BY warehouse, product, GROUP BY product and global aggregate.
grouping set parameter in group by clause.
I use GROUPING SETS syntax
to try to reduce redundant code, although it doesn't look perfect. It is worth mentioning that GROUP BY GROUPING SETS
is semantically equivalent to union
. Therefore, there is no performance optimization here, just reducing redundant code, and there are still inappropriate SELECT
operations.
SELECT
coalesce(date, week, month) AS date,
CASE
WHEN date IS NOT NULL THEN 'day'
WHEN week IS NOT NULL THEN 'week'
WHEN month IS NOT NULL THEN 'month'
END AS dateType,
sum
FROM (
SELECT date, trunc(date, 'WEEK') AS week, trunc(date, 'MONTH') AS month, sum(num) AS sum
FROM example
GROUP BY GROUPING SETS(date, trunc(date, 'WEEK'), trunc(date, 'MONTH'))
)
ORDER BY dateType, date;