Home > Net >  Replacing UNION ALL to increase speed when summarizing multiple columns
Replacing UNION ALL to increase speed when summarizing multiple columns

Time:09-08

I have a dataset with columns V1 V2 V3 V4 ... V200 where I would like to return a table with the column name and how many NULL, zero and below zero values each column has. My Current code looks like:

SELECT 'V1' AS column_name, SUM(CASE WHEN V1 IS NULL THEN 1 ELSE 0 END) AS n_null, SUM(...V1 = 0) AS n_zero, SUM(... V1 < 0) AS n_below_zero UNION ALL
...
SELECT 'V200' AS column_name, SUM(CASE WHEN V200 IS NULL THEN 1 ELSE 0 END) AS n_null, SUM(...V200 = 0) AS n_zero, SUM(... V200 < 0) AS n_below_zero 

Is there a faster way than this? I feel that 200 UNION ALL is not the fastest way

I am running on Databricks, so Spark SQL.

CodePudding user response:

It should be way faster to have one full table scan and aggregate all rows to one row containing all counts. Here is how to do that:

SELECT 
  SUM(CASE WHEN V1   IS NULL THEN 1 ELSE 0 END) AS v1_null_cnt,
  SUM(CASE WHEN V1   = 0     THEN 1 ELSE 0 END) AS v1_zero_cnt,
  SUM(CASE WHEN V1   < 0     THEN 1 ELSE 0 END) AS v1_nega_cnt,
  SUM(CASE WHEN V2   IS NULL THEN 1 ELSE 0 END) AS v2_null_cnt,
  SUM(CASE WHEN V2   = 0     THEN 1 ELSE 0 END) AS v2_zero_cnt,
  SUM(CASE WHEN V2   < 0     THEN 1 ELSE 0 END) AS v2_nega_cnt,
  ...
  SUM(CASE WHEN V200 < 0     THEN 1 ELSE 0 END) AS v200_nega_cnt
FROM mytable;

Once you have this result row, you can unpivot it to get one row per table column, if you prefer this.

  • Related