Home > Software engineering >  Read Json in Pyspark
Read Json in Pyspark

Time:12-20

I want to read a JSON file in PySpark, but the JSON file is in this format (without comma and square brackets):

{"id": 1, "name": "jhon"}
{"id": 2, "name": "bryan"}
{"id": 3, "name": "jane"}

Is there an easy way to read this JSON in PySpark?

I have already tried this code:

df= spark.read.option("multiline", "true").json("data.json")
df.write.parquet("data.parquet")

But it doesn't work: in parquet file just the first line appears.

I just want to read this JSON file and save as parquet...

CodePudding user response:

Try to read as a text file first, and parse it to a json object

from pyspark.sql.functions import from_json, col
import json

lines = spark.read.text("data.json")
parsed_lines = lines.rdd.map(lambda row: json.loads(row[0]))

# Convert JSON objects --> a DataFrame
df = parsed_lines.toDF()
df.write.parquet("data.parquet")

CodePudding user response:

Only the first line appears while reading data from your mentioned file because of multiline parameter is set as True but in this case one line is a JSON object. So if you set multiline parameter as False it will work as expected.

df= spark.read.option("multiline", "false").json("data.json")
df.show()

In case if your JSON file would have had a JSON array in file like

[
{"id": 1, "name": "jhon"},
{"id": 2, "name": "bryan"},
{"id": 3, "name": "jane"}
]

or

[
    {
        "id": 1, 
        "name": "jhon"
    },
    {
        "id": 2, 
        "name": "bryan"
    }
]

multiline parameter set to True will work.

  • Related