Pyspark avoid _corrupt_record column while loading a list of jsons-CodePudding

I have a dataframe that loads data from a python list of jsons.

df = spark.read\
            .option("inferSchema","true")\
            .option("multiline",True)\
            .json(sc.parallelize(array_json))

The json structure can look like this:

{
  "name" : "Alex"
  "value" : 2,
  "tag" : {
    "property1" : "value1"
  }
}

But also it can look like this:

{
  "name" : "Robert"
  "value" : 2,
  "tag" : None
}

As you can see, the property tag can be a json or a null value.

The problem I'm having is that I'm getting a column called _corrupt_record only for the json values that have the tag value as None.

_corrupt_record	name	value	tag
null	Alex	2	{"property1":"value1"}
{"name":"Robert","value":2,"tag":None	null	null	null

I want the dataframe to be like the following:

name	value	tag
Alex	2	{"property1":"value1"}
Robert	2	None

Any ideas on how to solve this?

CodePudding user response：

I reproduced this and faced same issue with the JSON data that you have provided.

enter image description here

Here, in the above JSON, the None value in not inside any quotes and it may cause the corrupt_record as it is not any type of int, string etc.

To get the desired dataframe like above, try to provide the schema of the JSON explicitly as suggested by @Alex Ott.

from pyspark.sql.types import *
schema = StructType([
    StructField("name", StringType(), True),
    StructField("tag", StringType(), True),
    StructField("value", IntegerType(), True)
])

df=spark.createDataFrame(data=myjson,schema=schema)
df.show()

If we give the schema explicitly, spark identifies the type before and assigns null values for None.

OUTPUT:

enter image description here

If your JSON is a file, you can try like this SO thread by blackbishop.