I have a sql.DataFrame
:
ID | Qualifier
-------------
A | "test"
A | "test2"
And I would like to get to:
ID | QualifierList
-------------
A | ["test", "test2"]
Using spark sql
functions.
I can't seem to find any way with array_join()
etc. Any help would be appreciated!
The datasource I'm using is much larger than this, but when I try array_join()
I get the values that exist in the Qualifier
column, but not the values where ID
are the same.
CodePudding user response:
You can use the groupBy
function, followed by the collect_list
aggregation function. Something like this:
import org.apache.spark.sql._
import spark.implicits._
val df = Seq(
("A", "test"),
("A", "test2")
).toDF("ID", "Qualifier")
val output = df.groupBy("ID").agg(collect_list("Qualifier").alias("QualifiersList"))
output.show
--- --------------
| ID|QualifiersList|
--- --------------
| A| [test, test2]|
--- --------------