Home > Software engineering >  Pyspark avoid _corrupt_record column while loading a list of jsons
Pyspark avoid _corrupt_record column while loading a list of jsons

Time:08-10

I have a dataframe that loads data from a python list of jsons.

df = spark.read\
            .option("inferSchema","true")\
            .option("multiline",True)\
            .json(sc.parallelize(array_json))

The json structure can look like this:

{
  "name" : "Alex"
  "value" : 2,
  "tag" : {
    "property1" : "value1"
  }
}

But also it can look like this:

{
  "name" : "Robert"
  "value" : 2,
  "tag" : None
}

As you can see, the property tag can be a json or a null value.

The problem I'm having is that I'm getting a column called _corrupt_record only for the json values that have the tag value as None.

_corrupt_record name value tag
null Alex 2 {"property1":"value1"}
{"name":"Robert","value":2,"tag":None null null null

I want the dataframe to be like the following:

name value tag
Alex 2 {"property1":"value1"}
Robert 2 None

Any ideas on how to solve this?

CodePudding user response:

I reproduced this and faced same issue with the JSON data that you have provided.

enter image description here

enter image description here

Here, in the above JSON, the None value in not inside any quotes and it may cause the corrupt_record as it is not any type of int, string etc.

To get the desired dataframe like above, try to provide the schema of the JSON explicitly as suggested by @Alex Ott.

from pyspark.sql.types import *
schema = StructType([
    StructField("name", StringType(), True),
    StructField("tag", StringType(), True),
    StructField("value", IntegerType(), True)
])

df=spark.createDataFrame(data=myjson,schema=schema)
df.show()

If we give the schema explicitly, spark identifies the type before and assigns null values for None.

OUTPUT:

enter image description here

If your JSON is a file, you can try like this SO thread by blackbishop.

  • Related