I have a dataset with columns V1 V2 V3 V4 ... V200
where I would like to return a table with the column name and how many NULL
, zero and below zero values each column has. My Current code looks like:
SELECT 'V1' AS column_name, SUM(CASE WHEN V1 IS NULL THEN 1 ELSE 0 END) AS n_null, SUM(...V1 = 0) AS n_zero, SUM(... V1 < 0) AS n_below_zero UNION ALL
...
SELECT 'V200' AS column_name, SUM(CASE WHEN V200 IS NULL THEN 1 ELSE 0 END) AS n_null, SUM(...V200 = 0) AS n_zero, SUM(... V200 < 0) AS n_below_zero
Is there a faster way than this? I feel that 200 UNION ALL
is not the fastest way
I am running on Databricks, so Spark SQL.
CodePudding user response:
It should be way faster to have one full table scan and aggregate all rows to one row containing all counts. Here is how to do that:
SELECT
SUM(CASE WHEN V1 IS NULL THEN 1 ELSE 0 END) AS v1_null_cnt,
SUM(CASE WHEN V1 = 0 THEN 1 ELSE 0 END) AS v1_zero_cnt,
SUM(CASE WHEN V1 < 0 THEN 1 ELSE 0 END) AS v1_nega_cnt,
SUM(CASE WHEN V2 IS NULL THEN 1 ELSE 0 END) AS v2_null_cnt,
SUM(CASE WHEN V2 = 0 THEN 1 ELSE 0 END) AS v2_zero_cnt,
SUM(CASE WHEN V2 < 0 THEN 1 ELSE 0 END) AS v2_nega_cnt,
...
SUM(CASE WHEN V200 < 0 THEN 1 ELSE 0 END) AS v200_nega_cnt
FROM mytable;
Once you have this result row, you can unpivot it to get one row per table column, if you prefer this.