Pyspark: Write CSV from JSON file with struct column-CodePudding

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?

from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .master("local[1]")
    .appName("sample")
    .getOrCreate())

df = (spark.read.option("multiline", "true")
    .json("data/origin/crops.json"))

df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))

(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))

root
 |-- history: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- finished_at: string (nullable = true)
 |    |    |-- started_at: string (nullable = true)
 |-- id: long (nullable = true)
 |-- trial: struct (nullable = true)
 |    |-- is_trial: boolean (nullable = true)
 |    |-- ws10_max: double (nullable = true)

I'm trying to return something like this

started_at	finished_at	is_trial	ws10_max
First	row	row
Second	row	row

Thank you!

CodePudding user response：

Use explode on array and select("struct.*") on struct.

df.select("trial", "id", explode('history').alias('history')),
  .select('id', 'history.*', 'trial.*'))