Home > Mobile >  Pyspark reading json file with indentation character (\t)
Pyspark reading json file with indentation character (\t)

Time:08-02

I am trying to read a json file using pyspark. I am usually able to open json file, however somehow one of my json files, when reading, shows indentation as \t character. At first, I made the following attempt to read the file:

spark = SparkSession.builder.appName("spark_learning").getOrCreate()
read1 = spark.read.format("json").option("multiplelines", "true").load(file_path)

This resulted in the ['_corrupt_record'] as outcome. In the second attempt, I tried the following code

read2 = spark.read.format("text").load(file_path)
read2.show() 

The output is

 -------------------- 
|               value|
 -------------------- 
|                   {|
| \t"key1": "value1",|
| \t"key2": "value2",|
| \t"key3": "value3",|
|        \t"key4": [{|
|\t\t"sub_key1": 1...|
|  \t\t"sub_key2": [{|
|\t\t\t"nested_key...|
|                \t}]|
|                   }|
 -------------------- 

output snapshot attached

When compared this json file to others I was able to read, I noticed the difference of \t. It seems that the file is reading indentation as character (\t). I also tried to replace \t by blank space using the answers available (e.g, How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?). However, I was not successful. It was still giving me corrupt_record column. I would be happy to receive any help from this community.

(PS: I am new to Big Data world and PySpark.)

Here is the sample data:

{
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": [{
    "sub_key1": 1111,
    "sub_key2": [{
        "nested_key1": [5555]}]
}]

}

(https://drive.google.com/file/d/1_0-9d41LnFR8_OGP4k0HK7ghn1JkQhmR/view?usp=sharing)

CodePudding user response:

The .option() should be multiline not multiplelines. With this change you should be able to read the json as is into a dataframe, otherwise you have to read with wholeTextFiles() and map it to json.

df=spark.read.format("json").option("multiline","true").load(file_path)
df.show(truncate=False)

 ------ ------ ------ -------------------- 
|key1  |key2  |key3  |key4                |
 ------ ------ ------ -------------------- 
|value1|value2|value3|[{1111, [{[5555]}]}]|
 ------ ------ ------ -------------------- 
  • Related