I have the following python dictionary:
ranges = {
(0, 10): '0 - 10',
(10, 100): '10 - 100',
(100, float('inf')): '100 '
}
And the following df:
Id | Value |
---|---|
001 | 9 |
002 | 10 |
003 | 300 |
I would like to add a column that returns the dictionary's value if the "Value" column is between (exclusive) the two numbers in the dictionary's key.
So resulting df should look like this:
Id | Value | Range |
---|---|---|
001 | 9 | 0 - 10 |
002 | 10 | 10 - 100 |
003 | 300 | 100 |
I know I can use withColumn and when, e.g.:
df.withColumn(
'Range',
.when((col('Value') >= lit(0)) & (col('Value') < lit(10)), '0 - 10')
)
and so on, but it would be inefficient if I have 100s of key-value pairs or wanted to test with different numbers in the keys.
Hopefully this makes sense. I would appreciate any and all inputs. Thank you so much in advance.
CodePudding user response:
The dictionary can be transformed into an SQL expression:
range_expr='case '
for r, label in ranges.items():
range_expr = f'when value >= {r[0]} '
if r[1] != float('inf'):
range_expr = f'and value < {r[1]} '
range_expr = f'then "{label}" '
range_expr = 'end'
range_expr
has now the value
case when value >= 0 and value < 10 then "0 - 10" when value >= 10 and value < 100 then "10 - 100" when value >= 100 then "100 " end
and can be used to construct a new column:
from pyspark.sql import functions as F
df.withColumn('range', F.expr(range_expr)).show()
Output:
--- ----- --------
| id|value| range|
--- ----- --------
|001| 9| 0 - 10|
|002| 10|10 - 100|
|003| 300| 100 |
--- ----- --------