Just as the title, in scala spark, if I have a dataframe like below:
------------- ----------
| key | Time |
------------- ----------
| 1 | 1 |
| 1 | 2 |
| 1 | 4 |
| 2 | 2 |
| 2 | 3 |
------------- ----------
For same key, how can I only keep the key with the least time and drop all other unnecessary row? In this case, for key 1, it has 3 rows with different time, the least time is 1, so I only want to keep the key 1, time 1 row and drop other 2 rows for key 1. Same with key 2, I only want to keep key 2, time 2, so I drop the key 2, time 3 row. The format of key is LongType and the format of time is StringType. If there is some way to achieve this?
------------- ----------
| key | Time |
------------- ----------
| 1 | 1 |
| 2 | 2 |
------------- ----------
I tried to use drop or filter function, but I don't think they works.
CodePudding user response:
Try something similar to this, my own data here of course:
%scala
import spark.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq( (1,7), (1,8), (1,9), (2,2), (2,99) )).toDF("i", "c")
df.groupBy($"i").agg(min($"c")).show()
returns:
--- ------
| i|min(c)|
--- ------
| 1| 7|
| 2| 2|
--- ------