Home > Software design >  Create Array of Unique Values from ColA where Col B are the same in Sql.DataFrame
Create Array of Unique Values from ColA where Col B are the same in Sql.DataFrame

Time:12-21

I have a sql.DataFrame:

ID | Qualifier
-------------
A  |    "test"
A  |    "test2"

And I would like to get to:

ID | QualifierList
-------------
A  |    ["test", "test2"]

Using spark sql functions.

I can't seem to find any way with array_join() etc. Any help would be appreciated!

The datasource I'm using is much larger than this, but when I try array_join() I get the values that exist in the Qualifier column, but not the values where ID are the same.

CodePudding user response:

You can use the groupBy function, followed by the collect_list aggregation function. Something like this:

import org.apache.spark.sql._
import spark.implicits._

val df = Seq(
  ("A", "test"),
  ("A", "test2")
).toDF("ID", "Qualifier")

val output = df.groupBy("ID").agg(collect_list("Qualifier").alias("QualifiersList"))

output.show
 --- --------------                                                                                                                                                                                                                                                             
| ID|QualifiersList|                                                                                                                                                                                                                                                            
 --- --------------                                                                                                                                                                                                                                                             
|  A| [test, test2]|                                                                                                                                                                                                                                                            
 --- -------------- 
  • Related