Create Array of Unique Values from ColA where Col B are the same in Sql.DataFrame-CodePudding

I have a sql.DataFrame:

ID | Qualifier
-------------
A  |    "test"
A  |    "test2"

And I would like to get to:

ID | QualifierList
-------------
A  |    ["test", "test2"]

Using spark sql functions.

I can't seem to find any way with array_join() etc. Any help would be appreciated!

The datasource I'm using is much larger than this, but when I try array_join() I get the values that exist in the Qualifier column, but not the values where ID are the same.

CodePudding user response：

You can use the groupBy function, followed by the collect_list aggregation function. Something like this:

import org.apache.spark.sql._
import spark.implicits._

val df = Seq(
  ("A", "test"),
  ("A", "test2")
).toDF("ID", "Qualifier")

val output = df.groupBy("ID").agg(collect_list("Qualifier").alias("QualifiersList"))

output.show
 --- --------------                                                                                                                                                                                                                                                             
| ID|QualifiersList|                                                                                                                                                                                                                                                            
 --- --------------                                                                                                                                                                                                                                                             
|  A| [test, test2]|                                                                                                                                                                                                                                                            
 --- --------------