Home > Software engineering >  Extract columns from a json and write into a dataframe
Extract columns from a json and write into a dataframe

Time:02-05

I have the following string:

{
    "code": 4,
    "results": [
        {
            "requests": 100,
            "requests_country": 291,
            "listing": {
                "first": 1,
                "second": 2
            }
        },
        {
            "requests": 200,
            "requests_country": 292,
            "listing": {
                "first": 10,
                "second": 220
            }
        }
    ]
}

I would like to extract certain values in order to create a dataframe. This is the desired output:

 --------- ---------------- 
| requests|requests_country|
 --------- ---------------- 
|  100    |291             |
|  200    |292             |
 --------- ---------------- 

I tried a lot of methods but none work.

I tried converting to map and then use parse to extract results but I keep getting errors.

CodePudding user response:

You could change the string to be one line, Spark expects JSONs to be line delimited. Then you can read it as a Dataframe, and then select results column and read it again as JSON.

Something like this:

import spark.implicits._

val df = spark.read.json("path/to/your/jsonfile.json")

val jsonDF = df.select(col("results")).as[String]
val resultsDF = spark.read.json(jsonDF)

That should return a Dataframe with requests, request_countr, and listing fields.

CodePudding user response:

Make use of explode_outer

import org.apache.spark.sql.functions.{col, explode_outer}
import org.apache.spark.sql.SparkSession

object Main extends App {
  val spark = SparkSession.builder
    .master("local")
    .appName("Spark app")
    .getOrCreate()

  val df = spark.read.json("src/main/resources/app2_2.json")
  df.show()
  // ---- -------------------- 
  //|code|             results|
  // ---- -------------------- 
  //|   4|[{{1, 2}, 100, 29...|
  // ---- -------------------- 

  val df1 = df.select(explode_outer(col("results")))
  df1.show()
  // -------------------- 
  //|                 col|
  // -------------------- 
  //|  {{1, 2}, 100, 291}|
  //|{{10, 220}, 200, ...|
  // -------------------- 

  val df2 = df1.select(col("col.requests"), col("col.requests_country"))
  df2.show()
  // -------- ---------------- 
  //|requests|requests_country|
  // -------- ---------------- 
  //|     100|             291|
  //|     200|             292|
  // -------- ---------------- 
}

How to translate a complex nested JSON structure into multiple columns in a Spark DataFrame

Cannot cast dataframe column containing an array to String

Difference between explode and explode_outer

  • Related