Below is the input dataframe.
----------- --------- ------------------ ---------------------- -----------
| DATE | ID |sal | vat | flag |
----------- --------- ------------------ ---------------------- ------------
|10-may-2022| 1 | 1000.0| 12.0 1 |
|12-may-2022| 2 | 50.0| 6.0| 1 |
----------- --------- ------------------ ---------------------- ------------
I want to perfrom the below based on the flag column
If the flag column is 1, I will do the below.
df = srcdf.withColumn("sum",col("sal")*2)
display(df)
If the flag column is 2, I will do the below.
df = srcdf.withColumn("sum",col("sal")*4)
display(df)
Below is the code Im using.
flag = srcdf.select(col("flag"))
if flag == 1 :
df = srcdf.withColumn("sum",col("sal")*2)
display(df)
else:
df = srcdf.withColumn("sum",col("sal")*4)
display(df)
When I use the above, I am getting syntax error. Is there any other way I can achieve this using the pyspark conditional statements.
Thank you.
CodePudding user response:
Possible duplicate of this question.
You need to use when
with (or without) otherwise
from pyspark.sql.functions
.
from pyspark.sql.functions import when, col
df = srcdf\
.withColumn("sum", when(col("flag") == 1, col("sal") * 2)\
.when(col("flag") == 2, col("sal") * 4)
)
OR
from pyspark.sql.functions import when, col
df = srcdf\
.withColumn("sum", when(col("flag") == 1, col("sal") * 2)\
.otherwise(col("sal") * 4)
)