How to arrange dataframe according to the given json format-CodePudding

This is the dataframe which I'm working on I want to change it to a dataset with the follow format

"Id":{ "Unique-id" : "id" "Details" : { "Person-name" : "name" "Person-name" : "age" } }

CodePudding user response：

There are couple of issue with your json.

You can't have same key "Person-name" twice within the sub json "Details"
You are missing the outer curly braces.

So the corrected json object should look like this.

{"Id": {"Unique-id": "id", "Details": {"Person-name": "name", "Person-age": "25"}}}

If so, then the transformation is pretty straight forward.

df = (spark.read.json('test.json')
      .withColumn('Unique-Id', F.col('Id.Unique-id'))
      .withColumn('age', F.col('Id.Details.Person-age'))
      .withColumn('name', F.col('Id.Details.Person-name'))
      .drop('Id')
      .withColumnRenamed('Unique-Id', 'id')
  )

CodePudding user response：

You just need to select the fields that you need. Assuming the following input DataFrame:

case class Top(Id: Inner)
case class Inner(`Unique-id`: String, Details: Details)
case class Details(`Person-name`: String, `Person-age`: Int)

val ss: SparkSession = ???
import ss.implicits._

val df = Seq(
  Top(Inner("1201", Details("satish", 25)))
).toDF

You will have the following:

root
 |-- Id: struct (nullable = true)
 |    |-- Unique-id: string (nullable = true)
 |    |-- Details: struct (nullable = true)
 |    |    |-- Person-name: string (nullable = true)
 |    |    |-- Person-age: integer (nullable = false)

You can construct your desired output with:


df.select(
  col("Id.Unique-id") as "id",
  col("Id.Details.Person-name") as "name",
  col("Id.Details.Person-age") as "age",
)