I want to add a column to a data frame and depending on whether a certain value appears in the source json, the value of the column should be the value from the source or null. My code looks like this:
withColumn("STATUS_BIT", expr("case when 'statusBit:' in jsonDF.schema.simpleString() then statusBit else None end"))
When I run this, I am getting "mismatched input ''statusBit:'' expecting {< EOF >, '-'} . Am I doing something wrong with the quotation marks? When I try
withColumn("STATUS_BIT", expr("case when \'statusBit:\' in jsonDF.schema.simpleString() then statusBit else None end"))
I get the exact same error. Trying the whole thing without expr but as a simple when, triggers the error "condition should be a Column". Running 'statusBit:' in jsonDF.schema.simpleString() by itself returns True with the testdata I am using, but somehow I cant integrate it into the dataframe transformation.Thanks a lot for your help in advance.
edit: Applying the solution provided by PLTC has helped a lot, but I am still struggling to get this solution implemented in the when clause: I try
.withColumn("STATUS_BIT", when(lit(df.schema.simplestring()).contains("statusBit") is True, col(statusBit)).otherwise(None))
but it tells me "condition should be a Column". So I added an extra colum called SCHEMA, which is equal to lit(df.schema.simpleString) and I used this column in the condition:
.withColumn("STATUS_BIT", when(col("SCHEMA").contains("statusBit"), col("StatusBit")).otherwise(None)
The problem is that if I run this with test data that does not contain "statusBit", I get the error "No such struct field statusBit in ...", which is obviously the opposite of what I wanted to achieve
CodePudding user response:
jsonDF.schema.simpleString()
is Python variable, you can use it in Python way
from pyspark.sql import functions as F
df.withColumn("STATUS_BIT", F.lit(df.schema.simpleString()).contains('statusBit:'))