I have below JSON structure in my dataframe as a body
attribute. I would like to drop multiple columns/attributes from the content based on provided list, how can I do this in scala ?
Note that the list of attributes is variable in nature.
Let Say,
List of columns to drop : List(alias, firstName, lastName)
Input
"Content":{
"alias":"Jon",
"firstName":"Jonathan",
"lastName":"Mathew",
"displayName":"Jonathan Mathew",
"createdDate":"2021-08-10T13:06:35.866Z",
"updatedDate":"2021-08-10T13:06:35.866Z",
"isDeleted":false,
"address":"xx street",
"phone":"xxx90"
}
Output :
"Content":{
"displayName":"Jonathan Mathew",
"createdDate":"2021-08-10T13:06:35.866Z",
"updatedDate":"2021-08-10T13:06:35.866Z",
"isDeleted":false,
"address":"xx street",
"phone":"xxx90"
}
CodePudding user response:
You can use drop
to drop multiple columns at once :
val newDataframe = oldDataframe.drop("alias", "firstName", "lastName")
Documentation :
/**
* Returns a new Dataset with columns dropped.
* This is a no-op if schema doesn't contain column name(s).
*
* This method can only be used to drop top level columns. the colName string is treated literally
* without further interpretation.
*
* @group untypedrel
* @since 2.0.0
*/
@scala.annotation.varargs
def drop(colNames: String*): DataFrame
CodePudding user response:
You can get the list of attributes from the dataframe schema then update the column Content
by creating a struct with all attributes but those in your list of columns to drop.
Here's a complete working example:
val jsonStr = """{"id": 1,"Content":{"alias":"Jon","firstName":"Jonathan","lastName":"Mathew","displayName":"Jonathan Mathew","createdDate":"2021-08-10T13:06:35.866Z","updatedDate":"2021-08-10T13:06:35.866Z","isDeleted":false,"address":"xx street","phone":"xxx90"}}"""
val df = spark.read.json(Seq(jsonStr).toDS)
val attrToDrop = Seq("alias", "firstName", "lastName")
val contentAttrList = df.select("Content.*").columns
val df2 = df.withColumn(
"Content",
struct(
contentAttrList
.filter(!attrToDrop.contains(_))
.map(c => col(s"Content.$c")): _*
)
)
df2.printSchema
//root
// |-- Content: struct (nullable = false)
// | |-- address: string (nullable = true)
// | |-- createdDate: string (nullable = true)
// | |-- displayName: string (nullable = true)
// | |-- isDeleted: boolean (nullable = true)
// | |-- phone: string (nullable = true)
// | |-- updatedDate: string (nullable = true)
// |-- id: long (nullable = true)