I want to read a JSON file in PySpark, but the JSON file is in this format (without comma and square brackets):
{"id": 1, "name": "jhon"}
{"id": 2, "name": "bryan"}
{"id": 3, "name": "jane"}
Is there an easy way to read this JSON in PySpark?
I have already tried this code:
df= spark.read.option("multiline", "true").json("data.json")
df.write.parquet("data.parquet")
But it doesn't work: in parquet file just the first line appears.
I just want to read this JSON file and save as parquet...
CodePudding user response:
Try to read as a text file first, and parse it to a json object
from pyspark.sql.functions import from_json, col
import json
lines = spark.read.text("data.json")
parsed_lines = lines.rdd.map(lambda row: json.loads(row[0]))
# Convert JSON objects --> a DataFrame
df = parsed_lines.toDF()
df.write.parquet("data.parquet")
CodePudding user response:
Only the first line appears while reading data from your mentioned file because of multiline
parameter is set as True
but in this case one line is a JSON object. So if you set multiline
parameter as False
it will work as expected.
df= spark.read.option("multiline", "false").json("data.json")
df.show()
In case if your JSON file would have had a JSON array in file like
[
{"id": 1, "name": "jhon"},
{"id": 2, "name": "bryan"},
{"id": 3, "name": "jane"}
]
or
[
{
"id": 1,
"name": "jhon"
},
{
"id": 2,
"name": "bryan"
}
]
multiline
parameter set to True
will work.