I have a small PySpark DataFrame df
:
index col1
0 1
1 3
2 4
And a dictionary:
LOOKUP = {0: 2, 1: 5, 2: 5, 3: 4, 4: 6}
I now want to add an extra column col2
to df
, equal to the LOOKUP
values of col1
.
My output should look like this:
index col1 col2
0 1 5
1 3 4
2 4 6
I tried using:
df = df.withColumn(col("col2"), LOOKUP[col("col1")])
But this gave me errors, as well as using expr
.
How to achieve this in PySpark?
CodePudding user response:
You can use a map
column that you create from the lookup
dictionary:
from itertools import chain
from pyspark.sql import functions as F
lookup = {0: 2, 1: 5, 2: 5, 3: 4, 4: 6}
lookup_map = F.create_map(*[F.lit(x) for x in chain(*lookup.items())])
df1 = df.withColumn("col2", lookup_map[F.col("col1")])
df1.show()
# ----- ---- ----
#|index|col1|col2|
# ----- ---- ----
#| 0| 1| 5|
#| 1| 3| 4|
#| 2| 4| 6|
# ----- ---- ----
Another way would be to create a lookup_df
from the dict then join with your dataframe
CodePudding user response:
You van use a CASE WHEN
statement with python f-strings
here with the LOOKUP
dictionary:
from pyspark.sql import functions as F
column = 'col1' #column to replace
e = f"""CASE {' '.join([f"WHEN {column}='{k}' THEN '{v}'" for k,v in LOOKUP.items()])}
ELSE NULL END"""
out = df.withColumn("col2",F.expr(e))
out.show()
----- ---- ----
|index|col1|col2|
----- ---- ----
| 0| 1| 5|
| 1| 3| 4|
| 2| 4| 6|
----- ---- ----