Suppose I have a dataframe like below:
------------- ---------- ----- ------
| key | Time |value|value2|
------------- ---------- ----- ------
| 1 | 1 | 1 | 1 |
| 1 | 2 | 2 | 2 |
| 1 | 4 | 3 | 3 |
| 2 | 2 | 4 | 4 |
| 2 | 3 | 5 | 5 |
------------- ---------- ----- ------
I want to select the key with same value with the least time. For this case, the is the dataframe I want.
---------- ---------- ----- ------
| Time | key |value|value2|
---------- ---------- ----- ------
| 1 | 1 | 1 | 1 |
| 2 | 2 | 4 | 4 |
---------- ---------- ----- ------
I tried use groupBy
and agg
, but these operations will drop the value
columns. Is there a way to keep the value1
and value2
columns?
CodePudding user response:
You can use struct
to create a tuple containing the time
column and everything you want to keep (the s
column in the code below). Then, you can use s.*
to unfold the struct and retrieve your columns.
val result = df
.withColumn("s", struct('Time, 'value, 'value2))
.groupBy("key")
.agg(min('s) as "s")
.select("key", "s.*")
// to order the columns the way you want:
val new_result = result.select("Time", "key", "value", "value2")
// or
val new_result = result.select("Time", df.columns.filter(_ != "time") : _*)