How to convert JSON file into regular table DataFrame in Apache Spark-CodePudding

I have the following JSON fields

{"constructorId":1,"constructorRef":"mclaren","name":"McLaren","nationality":"British","url":"http://en.wikipedia.org/wiki/McLaren"}

{"constructorId":2,"constructorRef":"bmw_sauber","name":"BMW Sauber","nationality":"German","url":"http://en.wikipedia.org/wiki/BMW_Sauber"}

The following code produces the the following DataFrame: I'm running the code on Databricks

df = (spark.read 
      .format(csv) \
      .schema(mySchema) \
      .load(dataPath)    
           )

  display(df)

However, I need the DataFrame to look like the following:

I believe the problem is because the JSON is nested, and I'm trying to convert to CSV. However, I do need to convert to CSV.

Is there code that I can apply to remove the nested feature of the JSON?

CodePudding user response：

Just try:

someDF = spark.read.json(somepath)

Infer schema by default or supply your own, set in your case in pySpark multiLine to false.

someDF = spark.read.json(somepath, someschema, multiLine=False)

See https://spark.apache.org/docs/latest/sql-data-sources-json.html

With schema inference:

df = spark.read.option("multiline","false").json("/FileStore/tables/SOabc2.txt")
df.printSchema()
df.show()
df.count()

returns:

root
 |-- constructorId: long (nullable = true)
 |-- constructorRef: string (nullable = true)
 |-- name: string (nullable = true)
 |-- nationality: string (nullable = true)
 |-- url: string (nullable = true)

 ------------- -------------- ---------- ----------- -------------------- 
|constructorId|constructorRef|      name|nationality|                 url|
 ------------- -------------- ---------- ----------- -------------------- 
|            1|       mclaren|   McLaren|    British|http://en.wikiped...|
|            2|    bmw_sauber|BMW Sauber|     German|http://en.wikiped...|
 ------------- -------------- ---------- ----------- -------------------- 

Out[11]: 2