I have a column in dataframe which is of type ArrayType. For now in this column, I have empty lists. Eg:
-------
|xyz|
-------
|[a,,] |
|[] |
-------
I want to remove the null values from the list and output the column as :
-------
|xyz|
-------
|[a] |
-------
How can I achieve this while joining this column? Thank you.
CodePudding user response:
You can use map
and filter
to achieve this.
import spark.implicits._
import org.apache.spark.sql.Row
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val data = Seq(Array("a",null,""), Array(""))
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF("xyz")
df.show()
------
| xyz|
------
|[a,, ]|
| []|
------
// Use map to filter out all the null or empty strings, then remove rows that are empty arrays
val mappedDF = df
.map{case Row(x:WrappedArray[String]) => x
.filter(_ != null)
.filter(_.nonEmpty)
}
.toDF("xyz")
.filter(size(col("xyz")) > 0)
mappedDF.show()
---
|xyz|
---
|[a]|
---
CodePudding user response:
Try:
import org.apache.spark.sql.functions.{size, col, expr}
df.withColumn("xyz", expr("filter(xyz, x -> x is not null)"))
.filter(size(col("xyz")) > 0)