Home > Net >  How to read array of structs from CSV file in spark / scala
How to read array of structs from CSV file in spark / scala

Time:02-20

I am trying to read a CSV file where one of the columns consists of an array of JSON structs. When reading the CSV file into a dataframe, the column is of type string. The string looks like [{...} , {...} , ...]. What's the best way to get this to be an array of structs? I know the schema of the structs. My ultimate goal is to then explode the array, convert the structs to JSON, and then expand them using *

Also, I am aware of spark.read.json, but I only need to apply this to one column of my dataframe and am not sure how to do this. Is there a way to do this without specifying an explicit schema?

CodePudding user response:

Lets assume you have already read the data in a dataframe df and you have 2 columns

name , jsonData and jsonData has 3 values: address,city and state

you first create a schema for the json as shown below:

val schema = StructType([
StructField("address",StringType(),True),
StructField("city",StringType(),True),
StructField("state",StringType(),True)
])

then you extract the data you want using from_json

val dfexplode =  df.select(col("name"),from_json(col("jsonData"),schema).as("j"))
      .select("name","j.*")

dfexplode.printSchema()
dfexplode.show()

CodePudding user response:

With using from_json like Andy_101 suggested, you can choose an arbitrary row to infer the schema from using the solution here - How to query JSON data column using Spark DataFrames?

Although the solution of doing val json_schema = spark.read.json(df.select("json_strings").as[String]).schema and then doing df.withColumn("json_parsed", from_json(col("json_strings", json_schema) returns a DF where "json_parsed" column are all null values. Is this because the JSON is an array?

  • Related