Home > Software engineering >  Remove empty lists in Scala from ArrayType column in Spark Dataframe
Remove empty lists in Scala from ArrayType column in Spark Dataframe

Time:10-23

I have a column in dataframe which is of type ArrayType. For now in this column, I have empty lists. Eg:

 ------- 
|xyz|
 ------- 
|[a,,]  |
|[]     |
 ------- 

I want to remove the null values from the list and output the column as :

 ------- 
|xyz|
 ------- 
|[a]    |     
 ------- 

How can I achieve this while joining this column? Thank you.

CodePudding user response:

You can use map and filter to achieve this.

import spark.implicits._
import org.apache.spark.sql.Row
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col

val data = Seq(Array("a",null,""), Array(""))
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF("xyz")
df.show() 

 ------ 
|   xyz|
 ------ 
|[a,, ]|
|    []|
 ------ 


// Use map to filter out all the null or empty strings, then remove rows that are empty arrays
val mappedDF = df
  .map{case Row(x:WrappedArray[String]) => x
    .filter(_ != null)
    .filter(_.nonEmpty)
  }
  .toDF("xyz")
  .filter(size(col("xyz")) > 0)
mappedDF.show()

 --- 
|xyz|
 --- 
|[a]|
 --- 

CodePudding user response:

Try:

import org.apache.spark.sql.functions.{size, col, expr}

df.withColumn("xyz", expr("filter(xyz, x -> x is not null)"))
  .filter(size(col("xyz")) > 0)
  • Related