I am learning Scala, and am trying to filter a select few columns from a large nested json file to make into a DataFrame. This is the gist of the json:
{
“meta”:
{“a”: 1, b: 2”} // I want to ignore meta
“objects”:
[
{
“caucus”: “Progressive”,
“person” :
{
“name”: “Mary”,
“party”: “Green Party”,
“age”: 50,
“gender”: “female” // etc..
}
}, // etc.
]
}
Hence the data looks like this as is, read in with spark.
val df = spark.read.json("file")
df.show()
-------------------- --------------------
| meta| objects|
-------------------- --------------------
|[limit -> 100.0, ...|[[, [116.0, 117.0...|
-------------------- --------------------
Instead of this, I want a DataFrame with the columns: Name | Party | Caucus.
I've messed around with explode() and have reproduced the schema as a StructType(), but am not sure how to deal with a nested structure like this in general.
CodePudding user response:
There's no generic way to handle it because of course it depends on the shape of your data. In your case, you want to explode an array, which will create a column called col
, that will contain structs.
You can then access the fields within the struct using the dot notation, so to extract the fields you asked for you can do this:
df.select(explode_outer($"objects")).
select(
$"col.caucus",
$"col.person.name",
$"col.person.party").show
----------- ---- -----------
| caucus|name| party|
----------- ---- -----------
|Progressive|Mary|Green Party|
----------- ---- -----------
CodePudding user response:
You can use ".*" on a column of type struct to tranform it it into multiple fields columns:
val df = spark.read.json("file.json")
df.select(col("meta"), explode(col("objects")).as("objects"))
.select("meta.*", "objects.*")
.select("a", "b", "caucus", "person.*")
.show(false)
--- --- ----------- --- ------ ---- -----------
|a |b |caucus |age|gender|name|party |
--- --- ----------- --- ------ ---- -----------
|1 |2 |Progressive|50 |female|Mary|Green Party|
--- --- ----------- --- ------ ---- -----------