Home > Blockchain >  Spark scala dataframe groupby
Spark scala dataframe groupby

Time:07-27

I have this dataframe and I want to create another one like the expected one. The thing is that I want to take the values of col_2 of the same group 'a' of col_1 and put them in the same row in distinct columns. Is there a way to do it?

# ----- ----- ----- -
#|col_1| id  |col_2| 
# ----- ----- ----- 
#|    a|    1|    c|
#|    a|    2|    f|
#|    a|    3|    i|
# ----- ----- ----- 

Expected

# ----- ----- ----- ------- 
#|col_1|col_c|col_f| col_i |
# ----- ----- ----- ------- 
#|    a|  c  | f   |  i    | 
  ----- ----- ----- ------- 

CodePudding user response:

Assume your dataset is called main. We can use the following query to extract the values as columns:

var created = main.groupBy("col_1").pivot("col_2").agg(first(col("col_2")))

This gives this output (which is almost as you like):

 ----- --- --- --- 
|col_1|  c|  f|  i|
 ----- --- --- --- 
|    a|  c|  f|  i|
 ----- --- --- --- 

Now, we find the columns that do not start with col_ and we rename them in the dataset:

val columns = created.columns.filterNot(c => c.startsWith("col_"))

for (i <- columns) {
  created = created.withColumnRenamed(i, "col_"   i)
}

Final output:

 ----- ----- ----- ----- 
|col_1|col_c|col_f|col_i|
 ----- ----- ----- ----- 
|    a|    c|    f|    i|
 ----- ----- ----- ----- 
  • Related