Home > Software design >  Sorting values of a dataframe list via lookup on another table
Sorting values of a dataframe list via lookup on another table

Time:09-28

I have two dataframes The first is a key with a list of ids. The second dataframe is a table containing values for each of the ids. I'd like to sort the list of ids in the first table with values from the second DF.

For example here is DF 1:

1, list(rabbit,dog,cat,giraffe)
2, list(kangaroo,rhino,bear)
3, list(fish,dolphin,bird,whale)

and here is DF 2:

cat,1
dog,2
rabbit, 3
giraffe, 4
rhino, 5
kangaro, 6, 
bear, 7
bird, 8
fish, 9
dolphin, 10
whale, 11

after sorting I'd like the output to be like this:

1, list(cat,dog,rabbit,giraffe)
2, list(rhino,kangaroo,bear)
3, list(bird,fish,dolphin,whale)

how can I achieve this desired sort in spark/scala?

I have tried .sort() but that just orders the values alphabetically

CodePudding user response:

UPDATE: You can explode the list, join it to the sorting DF to have the order column, and then use windowing to collect the list back in order:

Assuming DF1 is (id: Int, list: Seq[String]) and DF2 is (name: String, ord: Int):

val w = Window.partitionBy('id).orderBy('ord)
df1.select('id, explode('list).as("name"))
   .join(df2, "name")
   .withColumn("list", collect_list('name).over(w))
   .groupBy('id).agg(max('list))

 --- ---------------------------- 
|id |max(list)                   |
 --- ---------------------------- 
|1  |[cat, dog, rabbit, giraffe] |
|3  |[bird, fish, dolphin, whale]|
|2  |[rhino, kangaroo, bear]     |
 --- ---------------------------- 

Verified it on the small dataset you provided. Might be fairly slow if collected list is large

  • Related