Home > OS >  How to keep one specific row while dropping all other unnecessary rows in scala spark dataframe?
How to keep one specific row while dropping all other unnecessary rows in scala spark dataframe?

Time:11-30

Just as the title, in scala spark, if I have a dataframe like below:

 ------------- ---------- 
| key         | Time     |
 ------------- ---------- 
|  1          |        1 |  
|  1          |        2 | 
|  1          |        4 |
|  2          |        2 | 
|  2          |        3 | 
 ------------- ---------- 

For same key, how can I only keep the key with the least time and drop all other unnecessary row? In this case, for key 1, it has 3 rows with different time, the least time is 1, so I only want to keep the key 1, time 1 row and drop other 2 rows for key 1. Same with key 2, I only want to keep key 2, time 2, so I drop the key 2, time 3 row. The format of key is LongType and the format of time is StringType. If there is some way to achieve this?

 ------------- ---------- 
| key         | Time     |
 ------------- ---------- 
|  1          |        1 |  
|  2          |        2 | 
 ------------- ---------- 

I tried to use drop or filter function, but I don't think they works.

CodePudding user response:

Try something similar to this, my own data here of course:

%scala

import spark.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Seq( (1,7), (1,8), (1,9), (2,2), (2,99) )).toDF("i", "c")
df.groupBy($"i").agg(min($"c")).show()

returns:

 --- ------ 
|  i|min(c)|
 --- ------ 
|  1|     7|
|  2|     2|
 --- ------ 
  • Related