Home > Blockchain >  How to translate a complex nested JSON structure into multiple columns in a Spark DataFrame
How to translate a complex nested JSON structure into multiple columns in a Spark DataFrame

Time:01-31

I am learning Scala, and am trying to filter a select few columns from a large nested json file to make into a DataFrame. This is the gist of the json:

{
  “meta”: 
    {“a”: 1, b: 2”}    // I want to ignore meta
  “objects”:
  [
    {
         “caucus”: “Progressive”,
     “person” : 
         {
          “name”: “Mary”,
          “party”: “Green Party”,
          “age”: 50,
          “gender”: “female” // etc..
         }
    }, // etc.
   ] 
}

Hence the data looks like this as is, read in with spark.

    val df = spark.read.json("file")
    df.show()
 -------------------- -------------------- 
|                meta|             objects|
 -------------------- -------------------- 
|[limit -> 100.0, ...|[[, [116.0, 117.0...|
 -------------------- -------------------- 

Instead of this, I want a DataFrame with the columns: Name | Party | Caucus.

I've messed around with explode() and have reproduced the schema as a StructType(), but am not sure how to deal with a nested structure like this in general.

CodePudding user response:

There's no generic way to handle it because of course it depends on the shape of your data. In your case, you want to explode an array, which will create a column called col, that will contain structs. You can then access the fields within the struct using the dot notation, so to extract the fields you asked for you can do this:

df.select(explode_outer($"objects")).
  select(
     $"col.caucus", 
     $"col.person.name",
     $"col.person.party").show

 ----------- ---- ----------- 
|     caucus|name|      party|
 ----------- ---- ----------- 
|Progressive|Mary|Green Party|
 ----------- ---- ----------- 

CodePudding user response:

You can use ".*" on a column of type struct to tranform it it into multiple fields columns:

val df = spark.read.json("file.json")
df.select(col("meta"), explode(col("objects")).as("objects"))
  .select("meta.*", "objects.*")
  .select("a", "b", "caucus", "person.*")
  .show(false)


 --- --- ----------- --- ------ ---- ----------- 
|a  |b  |caucus     |age|gender|name|party      |
 --- --- ----------- --- ------ ---- ----------- 
|1  |2  |Progressive|50 |female|Mary|Green Party|
 --- --- ----------- --- ------ ---- ----------- 
  • Related