Home > Blockchain >  Add PySpark column based on dictionary where the keys are tuples
Add PySpark column based on dictionary where the keys are tuples

Time:09-22

I have the following python dictionary:

ranges = {
    (0, 10): '0 - 10',
    (10, 100): '10 - 100',
    (100, float('inf')): '100 '
}

And the following df:

Id Value
001 9
002 10
003 300

I would like to add a column that returns the dictionary's value if the "Value" column is between (exclusive) the two numbers in the dictionary's key.

So resulting df should look like this:

Id Value Range
001 9 0 - 10
002 10 10 - 100
003 300 100

I know I can use withColumn and when, e.g.:

df.withColumn(
    'Range',
     .when((col('Value') >= lit(0)) & (col('Value') < lit(10)), '0 - 10')
)

and so on, but it would be inefficient if I have 100s of key-value pairs or wanted to test with different numbers in the keys.

Hopefully this makes sense. I would appreciate any and all inputs. Thank you so much in advance.

CodePudding user response:

The dictionary can be transformed into an SQL expression:

range_expr='case '
for r, label  in ranges.items():
    range_expr  = f'when value >= {r[0]} '
    if r[1] != float('inf'):
      range_expr  = f'and value < {r[1]} '
    range_expr  = f'then "{label}" '
range_expr  = 'end'

range_expr has now the value

case when value >= 0 and value < 10 then "0 - 10" when value >= 10 and value < 100 then "10 - 100" when value >= 100 then "100 " end

and can be used to construct a new column:

from pyspark.sql import functions as F

df.withColumn('range', F.expr(range_expr)).show()

Output:

 --- ----- -------- 
| id|value|   range|
 --- ----- -------- 
|001|    9|  0 - 10|
|002|   10|10 - 100|
|003|  300|    100 |
 --- ----- -------- 
  • Related