Home > Mobile >  Remove an elemet from an array of struct in spark scala
Remove an elemet from an array of struct in spark scala

Time:07-11

I want to implement an functionality to remove an element from an array of struct in spark scala.For the date "2019-01-26" I want to remove the entire struct from the array column. Following is my code :

  import org.apache.spark.sql.types._
    
    val df=Seq(("123","Jack",Seq(("2020-04-26","200","72","ABC"),("2020-05-26","300","71","ABC"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"))),("124","jones",Seq(("2020-04-26","200","72","ABC"),("2020-05-26","300","71","ABC"),("2020-06-26","200","70","ABC"),("2020-08-26","300","69","ABC"),("2020-08-26","300","69","ABC"))),("125","daniel",Seq(("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF")))).toDF("id","name","history").withColumn("history",$"history".cast("array<struct<infodate:Date,amount1:Integer,amount2:Integer,detail:string>>"))
    
    scala> df.printSchema
    root
     |-- id: string (nullable = true)
     |-- name: string (nullable = true)
     |-- history: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- infodate: date (nullable = true)
     |    |    |-- amount1: integer (nullable = true)
     |    |    |-- amount2: integer (nullable = true)
     |    |    |-- detail: string (nullable = true)

enter image description here

So the for the date 2019-01-26 , I want to remove the struct in which it is present so that it is removed from the array column.I want a solution like this.

enter image description here

I manage to find the solution but it involves lot of hardcoding and I'm searching for a solution/suggestion that is optimal.

Hardcoded solution:

val dfnew=df
 .withColumn( "history" , 
  array_except(
   col("history"),
   array(
    struct(
     lit("2019-01-26").cast(DataTypes.DateType).alias("infodate"),
     lit("200").cast(DataTypes.IntegerType).alias("amount1"), 
     lit("70").cast(DataTypes.IntegerType).alias("amount2"), 
     lit("DEF").alias("detail")
    )
   )
  )
 )

Is there any way of optimally doing it with one filter condition only on date "2019-01-26", which removes the struct/array from the array column.

CodePudding user response:

I use an expression / filter here. Obviosuly it's a string so you can replace the date with a value so that there is even less hard coding. Filters are handy expressions as they let you use SQL notation to reference sub-components of the struct.

scala> :paste
// Entering paste mode (ctrl-D to finish)

df.withColumn( "history" , 
  expr( "filter( history , x -> x.infodate != '2019-01-26' )" )
 ).show(10,false)

// Exiting paste mode, now interpreting.

 --- ------ -------------------------------------------------------------------------------------------------------------------------------------------- 
|id |name  |history                                                                                                                                     |
 --- ------ -------------------------------------------------------------------------------------------------------------------------------------------- 
|123|Jack  |[[2020-04-26, 200, 72, ABC], [2020-05-26, 300, 71, ABC]]                                                                                    |
|124|jones |[[2020-04-26, 200, 72, ABC], [2020-05-26, 300, 71, ABC], [2020-06-26, 200, 70, ABC], [2020-08-26, 300, 69, ABC], [2020-08-26, 300, 69, ABC]]|
|125|daniel|[]                                                                                                                                          |
 --- ------ -------------------------------------------------------------------------------------------------------------------------------------------- 
  • Related