I have a table in which there are 4 columns: "ID", "FLAG_A", "FLAG_B", "FLAG_C". This is the SQL query I want to transform into PySpark, there are two conditions which I need to satisfy for both columns "FLAG_A" and "FLAG_B". How to do it in PySpark?
UPDATE STATUS_TABLE SET STATUS_TABLE.[FLAG_A] = "JAVA",
STATUS_TABLE.FLAG_B = "PYTHON"
WHERE (((STATUS_TABLE.[FLAG_A])="PROFESSIONAL_CODERS") AND
((STATUS_TABLE.FLAG_C) Is Null));
Is it possible to code this in a single statement by giving two conditions and satisfying the "FLAG_A" and "FLAG_B" columns in PySpark?
CodePudding user response:
I can't think of any way to rewrite this into a single statement which you thought of. I tried writing the UPDATE
query inside Spark, but it seems UPDATE
is not working:
: java.lang.UnsupportedOperationException: UPDATE TABLE is not supported temporarily.
The following does exactly the same as your UPDATE
query:
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'PROFESSIONAL_CODERS', 'X', None),
(2, 'KEEP', 'KEEP', 'KEEP')],
['ID', 'FLAG_A', 'FLAG_B', 'FLAG_C'])
Script:
cond = (F.col('FLAG_A') == 'PROFESSIONAL_CODERS') & F.isnull('FLAG_C')
df = df.withColumn('FLAG_B', F.when(cond, 'PYTHON').otherwise(F.col('FLAG_B')))
df = df.withColumn('FLAG_A', F.when(cond, 'JAVA').otherwise(F.col('FLAG_A')))
df.show()
# --- ------ ------ ------
# | ID|FLAG_A|FLAG_B|FLAG_C|
# --- ------ ------ ------
# | 1| JAVA|PYTHON| null|
# | 2| KEEP| KEEP| KEEP|
# --- ------ ------ ------