Home > database >  How to drop multiple columns from JSON body using scala
How to drop multiple columns from JSON body using scala

Time:12-15

I have below JSON structure in my dataframe as a body attribute. I would like to drop multiple columns/attributes from the content based on provided list, how can I do this in scala ?

Note that the list of attributes is variable in nature.

Let Say,

List of columns to drop : List(alias, firstName, lastName)

Input

  "Content":{
     "alias":"Jon",
     "firstName":"Jonathan",
     "lastName":"Mathew",
     "displayName":"Jonathan Mathew",
     "createdDate":"2021-08-10T13:06:35.866Z",
     "updatedDate":"2021-08-10T13:06:35.866Z",
     "isDeleted":false,
     "address":"xx street",
     "phone":"xxx90"
  }

Output :

"Content":{
     "displayName":"Jonathan Mathew",
     "createdDate":"2021-08-10T13:06:35.866Z",
     "updatedDate":"2021-08-10T13:06:35.866Z",
     "isDeleted":false,
     "address":"xx street",
     "phone":"xxx90"
  }

CodePudding user response:

You can use drop to drop multiple columns at once :

val newDataframe = oldDataframe.drop("alias", "firstName", "lastName")

Documentation :

/**
   * Returns a new Dataset with columns dropped.
   * This is a no-op if schema doesn't contain column name(s).
   *
   * This method can only be used to drop top level columns. the colName string is treated literally
   * without further interpretation.
   *
   * @group untypedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def drop(colNames: String*): DataFrame 

CodePudding user response:

You can get the list of attributes from the dataframe schema then update the column Content by creating a struct with all attributes but those in your list of columns to drop.

Here's a complete working example:

val jsonStr = """{"id": 1,"Content":{"alias":"Jon","firstName":"Jonathan","lastName":"Mathew","displayName":"Jonathan Mathew","createdDate":"2021-08-10T13:06:35.866Z","updatedDate":"2021-08-10T13:06:35.866Z","isDeleted":false,"address":"xx street","phone":"xxx90"}}"""

val df = spark.read.json(Seq(jsonStr).toDS)

val attrToDrop = Seq("alias", "firstName", "lastName")

val contentAttrList = df.select("Content.*").columns

val df2 = df.withColumn(
  "Content",
  struct(
    contentAttrList
      .filter(!attrToDrop.contains(_))
      .map(c => col(s"Content.$c")): _*
  )
)

df2.printSchema
//root
// |-- Content: struct (nullable = false)
// |    |-- address: string (nullable = true)
// |    |-- createdDate: string (nullable = true)
// |    |-- displayName: string (nullable = true)
// |    |-- isDeleted: boolean (nullable = true)
// |    |-- phone: string (nullable = true)
// |    |-- updatedDate: string (nullable = true)
// |-- id: long (nullable = true)

  • Related