I am unable to convert a list into pyspark dataframe as the list is containing null values. Here is my data list and column_names list
row = [Row(run_id,
None,
self.args.config_id,
'in-progress',
'locked',
datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
datetime.now().strftime('%Y-%m-%d %H:%M:%S')
)
]
columns = ['RUN_ID', 'RE_RUN_ID', 'CONFIG_ID', 'JOB_STATUS', 'LOCK_STATUS', 'INSERT_DTS', 'UPDATE_DTS']
df = sc.spark.createDataFrame(row, columns)
df.show(truncate=False)
The None is not inferred and it is giving the following error:
ValueError: Some of types cannot be determined after inferring
When I replace the None with '' (empty quotes) it is working. But it is not a Null value in the dataframe. Checked the same with the following code and it gives 0.
df.withColumn("flag", fn.when(fn.col("RE_RUN_ID").isNull(), 1).otherwise(0)).show()
How to create a dataframe from a list containing None values?
CodePudding user response:
In the case you have None rows Spark can't assign a type to the column. In that case the schema you supply shouldn't be a list of column names but an actual schema.
row = [Row(run_id,
None,
self.args.config_id,
'in-progress',
'locked',
datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
datetime.now().strftime('%Y-%m-%d %H:%M:%S')
)
]
columns = "run_id string, re_run_id string, config_id string, job_status string, lock_status string, insert_dts timestamp, update_dts timestamp"
df = sc.spark.createDataFrame(row, columns)
df.show(truncate=False)
By explicitly mentioning the column types the error won't pop up.
An alternative is casting the None into a specific type.
instead of None, lit(None).cast("string")
will also do the trick.