I am trying to read a json file using pyspark. I am usually able to open json file, however somehow one of my json files, when reading, shows indentation as \t character. At first, I made the following attempt to read the file:
spark = SparkSession.builder.appName("spark_learning").getOrCreate()
read1 = spark.read.format("json").option("multiplelines", "true").load(file_path)
This resulted in the ['_corrupt_record'] as outcome. In the second attempt, I tried the following code
read2 = spark.read.format("text").load(file_path)
read2.show()
The output is
--------------------
| value|
--------------------
| {|
| \t"key1": "value1",|
| \t"key2": "value2",|
| \t"key3": "value3",|
| \t"key4": [{|
|\t\t"sub_key1": 1...|
| \t\t"sub_key2": [{|
|\t\t\t"nested_key...|
| \t}]|
| }|
--------------------
When compared this json file to others I was able to read, I noticed the difference of \t. It seems that the file is reading indentation as character (\t). I also tried to replace \t by blank space using the answers available (e.g, How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?). However, I was not successful. It was still giving me corrupt_record column. I would be happy to receive any help from this community.
(PS: I am new to Big Data world and PySpark.)
Here is the sample data:
{
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": [{
"sub_key1": 1111,
"sub_key2": [{
"nested_key1": [5555]}]
}]
}
(https://drive.google.com/file/d/1_0-9d41LnFR8_OGP4k0HK7ghn1JkQhmR/view?usp=sharing)
CodePudding user response:
The .option()
should be multiline
not multiplelines
. With this change you should be able to read the json as is into a dataframe, otherwise you have to read with wholeTextFiles()
and map it to json.
df=spark.read.format("json").option("multiline","true").load(file_path)
df.show(truncate=False)
------ ------ ------ --------------------
|key1 |key2 |key3 |key4 |
------ ------ ------ --------------------
|value1|value2|value3|[{1111, [{[5555]}]}]|
------ ------ ------ --------------------