Unable to convert a list containing None/nulls to dataframe in pyspark using spark.createDataFrame()-CodePudding

I am unable to convert a list into pyspark dataframe as the list is containing null values. Here is my data list and column_names list

row = [Row(run_id,
           None,
           self.args.config_id,
           'in-progress',
           'locked',
           datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
           datetime.now().strftime('%Y-%m-%d %H:%M:%S')
           )
      ]
columns = ['RUN_ID', 'RE_RUN_ID', 'CONFIG_ID', 'JOB_STATUS', 'LOCK_STATUS', 'INSERT_DTS', 'UPDATE_DTS']
df = sc.spark.createDataFrame(row, columns)
df.show(truncate=False)

The None is not inferred and it is giving the following error:

ValueError: Some of types cannot be determined after inferring

When I replace the None with '' (empty quotes) it is working. But it is not a Null value in the dataframe. Checked the same with the following code and it gives 0.

df.withColumn("flag", fn.when(fn.col("RE_RUN_ID").isNull(), 1).otherwise(0)).show()

How to create a dataframe from a list containing None values?

CodePudding user response：

In the case you have None rows Spark can't assign a type to the column. In that case the schema you supply shouldn't be a list of column names but an actual schema.

row = [Row(run_id,
           None,
           self.args.config_id,
           'in-progress',
           'locked',
           datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
           datetime.now().strftime('%Y-%m-%d %H:%M:%S')
           )
      ]
columns = "run_id string, re_run_id string, config_id string, job_status string, lock_status string, insert_dts timestamp, update_dts timestamp"
df = sc.spark.createDataFrame(row, columns)
df.show(truncate=False)

By explicitly mentioning the column types the error won't pop up.

An alternative is casting the None into a specific type. instead of None, lit(None).cast("string") will also do the trick.