I have a table named 'numExample' with only one field like this:
| number |
| 6 |
| 7 |
| 9 |
| 12 |
| 12 |
| 18 |
| 13 |
| 23 |
| 16 |
| 20 |
| 21 |
| 50 |
| 6 |
| ... |
| 400 |
| ... |
The ...
means there may be multiple numbers and the upper limit is uncertain. Now I have to calculate how many times the numbers in each interval appear, the interval is left closed and right open, and the step size is fixed to 5, eg: [0, 5)、[5, 10)
. I want to use CASE
clause but the upper limit is uncertain, so I can't list all WHEN
clause.
SELECT numInterval, count(DISTINCT number) AS count
FROM (
SELECT
number,
CASE
WHEN number >=0 AND number < 5 THEN '[0, 5)'
WHEN number >=5 AND number < 10 THEN '[5, 10)'
WHEN number >=10 AND number < 15 THEN '[10, 15)'
...
...
END AS numInterval
FROM numExample
)
GROUP BY numInterval;
Both Spark SQL code and PySpark code are acceptable. I can't thank you enough for any help.
CodePudding user response:
- Calculate the left of interval:
floor(number/5)*5
- Group:
group by intervalLeft
, so we can get the result - Format the group name:
concat( "[", intervalLeft, ",", intervalLeft 5, ")" )
Here is the example code:
SELECT
intervalLeft,
concat( "[", intervalLeft, ",", intervalLeft 5, ")" ) AS numInterval,
count( DISTINCT number ) AS count
FROM
( SELECT number, floor( number / 5 )* 5 AS intervalLeft FROM numExample )
GROUP BY
intervalLeft;
Hope this helps.