Home > Mobile >  How to group by multiple dimensions when `group by` is written only once in Spark SQL?
How to group by multiple dimensions when `group by` is written only once in Spark SQL?

Time:01-11

I have a table named 'example' like this:

date num
2023-01-01 10
2023-01-01 10
2023-01-02 20
2023-01-03 30
2023-01-03 30
2023-01-04 40
2023-01-05 50
2023-01-06 60
2023-01-07 70
2023-01-07 70
2023-01-08 80
2023-01-09 90
2023-01-10 100

I want to group date field by day, week and month.

My Spark SQL code:

(
    SELECT date, 'day' AS dateType, sum(num) AS sum
    FROM example
    GROUP BY date
    ORDER BY date
)
UNION ALL
(
    SELECT trunc(date, 'WEEK') AS date, 'week' AS dateType, sum(num) AS sum
    FROM example
    GROUP BY trunc(date, 'WEEK')
    ORDER BY trunc(date, 'WEEK')
)
UNION ALL
(
    SELECT trunc(date, 'MONTH') AS date, 'month' AS dateType, sum(num) AS sum
    FROM example
    GROUP BY trunc(date, 'MONTH')
    ORDER BY trunc(date, 'MONTH')
);

Result:

date dateType sum
2023-01-01 day 20
2023-01-02 day 20
2023-01-03 day 60
2023-01-04 day 40
2023-01-05 day 50
2023-01-06 day 60
2023-01-07 day 140
2023-01-08 day 80
2023-01-09 day 90
2023-01-10 day 100
2022-12-26 week 20
2023-01-02 week 450
2023-01-09 week 190
2023-01-01 month 660

Duplicate code is the root of all evil in software design. When a system is littered with many snippets of indentical, or nearly identical code, it is indicative of sloppiness, carelessness, and sheer unprofessionalism. It is the guilt-edged responsibility of all software developers to root out and eliminate duplication whenever they find it.

I wrote almost the same code three times, but the date field is different, which does not conform to the principle of clean code. So i want to remove the union all statement and use group by only once.

Question: Is there a way to write group by only once and minimize duplicate code and generate the above result?

The final code like:

SELECT xxx AS date, 'xxx' AS dateType, sum(num) AS sum
FROM example
GROUP BY xxx
ORDER BY xxx;

Both Spark SQL code and PySpark code are acceptable. Thank you in advance.

CodePudding user response:

Your code actually works fine I think. You can also do this :

select date, dateType, sum(num) AS sum from
(
    SELECT date, 'day' AS dateType, num
    FROM example

    UNION ALL

    SELECT trunc(date, 'WEEK') AS date, 'week' AS dateType, num
    FROM example

    UNION ALL

    SELECT trunc(date, 'MONTH') AS date, 'month' AS dateType, num
    FROM example
) t 
GROUP BY date, dateType
ORDER BY date, dateType

You only have one group by un this case but you cannot get rid of the union all

CodePudding user response:

Groups the rows for each grouping set specified after GROUPING SETS. For example, GROUP BY GROUPING SETS ((warehouse), (product)) is semantically equivalent to union of results of GROUP BY warehouse and GROUP BY product. This clause is a shorthand for a UNION ALL where each leg of the UNION ALL operator performs aggregation of each grouping set specified in the GROUPING SETS clause. Similarly, GROUP BY GROUPING SETS ((warehouse, product), (product), ()) is semantically equivalent to the union of results of GROUP BY warehouse, product, GROUP BY product and global aggregate.

grouping set parameter in group by clause.

I use GROUPING SETS syntax to try to reduce redundant code, although it doesn't look perfect. It is worth mentioning that GROUP BY GROUPING SETS is semantically equivalent to union. Therefore, there is no performance optimization here, just reducing redundant code, and there are still inappropriate SELECT operations.

SELECT
    coalesce(date, week, month) AS date,
    CASE
        WHEN date IS NOT NULL THEN 'day'
        WHEN week IS NOT NULL THEN 'week'
        WHEN month IS NOT NULL THEN 'month'
    END AS dateType,
    sum
FROM (
    SELECT date, trunc(date, 'WEEK') AS week, trunc(date, 'MONTH') AS month, sum(num) AS sum
    FROM example
    GROUP BY GROUPING SETS(date, trunc(date, 'WEEK'), trunc(date, 'MONTH'))
)
ORDER BY dateType, date;
  • Related