Home > Software engineering >  How to extract value from json column in spark scala?
How to extract value from json column in spark scala?

Time:11-11

I have the following code:

class MyTest extends AnyFlatSpec with Matchers {

   ....

  it should "calculate" in {

    val testDf= Seq(
      testDf(1, "customer1", "Hi"),
      testDf(1, "customer2", "Hi")
    ).toDS().toDF()


    val out = MyClass.procOut(spark, testDf)
    out.count() should be(1)
    out.where(col("customer_id")==="customer1").first().getString(output.first().fieldIndex("json_col")) should be(?) // Here I have problem!

  }

}

My issue: out is dataframe the:

out.where(col("customer_id")==="customer1").first().getString(output.first().fieldIndex("json_col"))

extract the json column so it has:

{
  "statistics": {
    "Group2": {
      "buy": 1
    }
  }
}

my goal is to get the value of buy e.g to assert that the value in the json is 1.

so I want to do something like:

out.where(col("customer_id")==="customer1").first().getString(output.first().fieldIndex("json_col")["statistics"]["Group2"]["buy"]) should be (1)

obviously this is not the right syntax but I can't find what is the right syntax to extract specific value from json.

CodePudding user response:

  • json_tuple - Extract the Data from JSON and create them as a new columns.
val jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
val data = Seq((1, jsonString))
import spark.implicits._
val df=data.toDF("id","value")
import org.apache.spark.sql.functions.{json_tuple}
df.select(col("id"),json_tuple(col("value"),"Zipcode","ZipCodeType","City"))
  .toDF("id","Zipcode","ZipCodeType","City")
  .show(false)
  • get_json_object – Extracts JSON element from a JSON string based on json path specified.
import org.apache.spark.sql.functions.{get_json_object}
df.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").as("ZipCodeType"))
  .show(false)
  • Related