I want to implement an functionality to remove an element from an array of struct in spark scala.For the date "2019-01-26" I want to remove the entire struct from the array column. Following is my code :
import org.apache.spark.sql.types._
val df=Seq(("123","Jack",Seq(("2020-04-26","200","72","ABC"),("2020-05-26","300","71","ABC"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"))),("124","jones",Seq(("2020-04-26","200","72","ABC"),("2020-05-26","300","71","ABC"),("2020-06-26","200","70","ABC"),("2020-08-26","300","69","ABC"),("2020-08-26","300","69","ABC"))),("125","daniel",Seq(("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF")))).toDF("id","name","history").withColumn("history",$"history".cast("array<struct<infodate:Date,amount1:Integer,amount2:Integer,detail:string>>"))
scala> df.printSchema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- infodate: date (nullable = true)
| | |-- amount1: integer (nullable = true)
| | |-- amount2: integer (nullable = true)
| | |-- detail: string (nullable = true)
So the for the date 2019-01-26 , I want to remove the struct in which it is present so that it is removed from the array column.I want a solution like this.
I manage to find the solution but it involves lot of hardcoding and I'm searching for a solution/suggestion that is optimal.
Hardcoded solution:
val dfnew=df
.withColumn( "history" ,
array_except(
col("history"),
array(
struct(
lit("2019-01-26").cast(DataTypes.DateType).alias("infodate"),
lit("200").cast(DataTypes.IntegerType).alias("amount1"),
lit("70").cast(DataTypes.IntegerType).alias("amount2"),
lit("DEF").alias("detail")
)
)
)
)
Is there any way of optimally doing it with one filter condition only on date "2019-01-26", which removes the struct/array from the array column.
CodePudding user response:
I use an expression / filter here. Obviosuly it's a string so you can replace the date with a value so that there is even less hard coding. Filters are handy expressions as they let you use SQL notation to reference sub-components of the struct.
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.withColumn( "history" ,
expr( "filter( history , x -> x.infodate != '2019-01-26' )" )
).show(10,false)
// Exiting paste mode, now interpreting.
--- ------ --------------------------------------------------------------------------------------------------------------------------------------------
|id |name |history |
--- ------ --------------------------------------------------------------------------------------------------------------------------------------------
|123|Jack |[[2020-04-26, 200, 72, ABC], [2020-05-26, 300, 71, ABC]] |
|124|jones |[[2020-04-26, 200, 72, ABC], [2020-05-26, 300, 71, ABC], [2020-06-26, 200, 70, ABC], [2020-08-26, 300, 69, ABC], [2020-08-26, 300, 69, ABC]]|
|125|daniel|[] |
--- ------ --------------------------------------------------------------------------------------------------------------------------------------------