I have the following string:
{
"code": 4,
"results": [
{
"requests": 100,
"requests_country": 291,
"listing": {
"first": 1,
"second": 2
}
},
{
"requests": 200,
"requests_country": 292,
"listing": {
"first": 10,
"second": 220
}
}
]
}
I would like to extract certain values in order to create a dataframe. This is the desired output:
--------- ----------------
| requests|requests_country|
--------- ----------------
| 100 |291 |
| 200 |292 |
--------- ----------------
I tried a lot of methods but none work.
I tried converting to map and then use parse
to extract results but I keep getting errors.
CodePudding user response:
You could change the string to be one line, Spark expects JSONs to be line delimited. Then you can read it as a Dataframe, and then select results
column and read it again as JSON.
Something like this:
import spark.implicits._
val df = spark.read.json("path/to/your/jsonfile.json")
val jsonDF = df.select(col("results")).as[String]
val resultsDF = spark.read.json(jsonDF)
That should return a Dataframe
with requests
, request_countr
, and listing
fields.
CodePudding user response:
Make use of explode_outer
import org.apache.spark.sql.functions.{col, explode_outer}
import org.apache.spark.sql.SparkSession
object Main extends App {
val spark = SparkSession.builder
.master("local")
.appName("Spark app")
.getOrCreate()
val df = spark.read.json("src/main/resources/app2_2.json")
df.show()
// ---- --------------------
//|code| results|
// ---- --------------------
//| 4|[{{1, 2}, 100, 29...|
// ---- --------------------
val df1 = df.select(explode_outer(col("results")))
df1.show()
// --------------------
//| col|
// --------------------
//| {{1, 2}, 100, 291}|
//|{{10, 220}, 200, ...|
// --------------------
val df2 = df1.select(col("col.requests"), col("col.requests_country"))
df2.show()
// -------- ----------------
//|requests|requests_country|
// -------- ----------------
//| 100| 291|
//| 200| 292|
// -------- ----------------
}
How to translate a complex nested JSON structure into multiple columns in a Spark DataFrame