I am trying to read a CSV file where one of the columns consists of an array of JSON structs. When reading the CSV file into a dataframe, the column is of type string. The string looks like [{...} , {...} , ...]
. What's the best way to get this to be an array of structs? I know the schema of the structs. My ultimate goal is to then explode the array, convert the structs to JSON, and then expand them using *
Also, I am aware of spark.read.json
, but I only need to apply this to one column of my dataframe and am not sure how to do this. Is there a way to do this without specifying an explicit schema?
CodePudding user response:
Lets assume you have already read the data in a dataframe df and you have 2 columns
name , jsonData
and jsonData has 3 values: address,city and state
you first create a schema
for the json as shown below:
val schema = StructType([
StructField("address",StringType(),True),
StructField("city",StringType(),True),
StructField("state",StringType(),True)
])
then you extract the data you want using from_json
val dfexplode = df.select(col("name"),from_json(col("jsonData"),schema).as("j"))
.select("name","j.*")
dfexplode.printSchema()
dfexplode.show()
CodePudding user response:
With using from_json
like Andy_101 suggested, you can choose an arbitrary row to infer the schema from using the solution here - How to query JSON data column using Spark DataFrames?
Although the solution of doing val json_schema = spark.read.json(df.select("json_strings").as[String]).schema
and then doing df.withColumn("json_parsed", from_json(col("json_strings", json_schema)
returns a DF where "json_parsed" column are all null values. Is this because the JSON is an array?