Home > Software engineering >  How to use a column value as key to a dictionary in PySpark?
How to use a column value as key to a dictionary in PySpark?

Time:12-24

I have a small PySpark DataFrame df:

index    col1
0        1    
1        3
2        4

And a dictionary:

LOOKUP = {0: 2, 1: 5, 2: 5, 3: 4, 4: 6}

I now want to add an extra column col2 to df, equal to the LOOKUP values of col1.

My output should look like this:

index    col1 col2
0        1    5    
1        3    4
2        4    6

I tried using:

df = df.withColumn(col("col2"), LOOKUP[col("col1")])

But this gave me errors, as well as using expr.

How to achieve this in PySpark?

CodePudding user response:

You can use a map column that you create from the lookup dictionary:

from itertools import chain
from pyspark.sql import functions as F

lookup = {0: 2, 1: 5, 2: 5, 3: 4, 4: 6}
lookup_map = F.create_map(*[F.lit(x) for x in chain(*lookup.items())])

df1 = df.withColumn("col2", lookup_map[F.col("col1")])

df1.show()
# ----- ---- ---- 
#|index|col1|col2|
# ----- ---- ---- 
#|    0|   1|   5|
#|    1|   3|   4|
#|    2|   4|   6|
# ----- ---- ---- 

Another way would be to create a lookup_df from the dict then join with your dataframe

CodePudding user response:

You van use a CASE WHEN statement with python f-strings here with the LOOKUP dictionary:

from pyspark.sql import functions as F
column = 'col1' #column to replace
e = f"""CASE {' '.join([f"WHEN {column}='{k}' THEN '{v}'" for k,v in LOOKUP.items()])} 
        ELSE NULL END"""
out = df.withColumn("col2",F.expr(e))

out.show()

 ----- ---- ---- 
|index|col1|col2|
 ----- ---- ---- 
|    0|   1|   5|
|    1|   3|   4|
|    2|   4|   6|
 ----- ---- ---- 
  • Related