How to use agg function while not drop other columns in dataframe?-CodePudding

Suppose I have a dataframe like below:

 ------------- ---------- ----- ------ 
| key         | Time     |value|value2|
 ------------- ---------- ----- ------ 
|  1          |        1 |  1  |   1  |
|  1          |        2 |  2  |   2  |
|  1          |        4 |  3  |   3  |
|  2          |        2 |  4  |   4  |
|  2          |        3 |  5  |   5  | 
 ------------- ---------- ----- ------

I want to select the key with same value with the least time. For this case, the is the dataframe I want.

 ---------- ---------- ----- ------ 
| Time     | key      |value|value2|
 ---------- ---------- ----- ------ 
|  1       |        1 |  1  |   1  |
|  2       |        2 |  4  |   4  |
 ---------- ---------- ----- ------

I tried use groupBy and agg, but these operations will drop the value columns. Is there a way to keep the value1 and value2 columns?

CodePudding user response：

You can use struct to create a tuple containing the time column and everything you want to keep (the s column in the code below). Then, you can use s.* to unfold the struct and retrieve your columns.

val result = df
    .withColumn("s", struct('Time, 'value, 'value2))
    .groupBy("key")
    .agg(min('s) as "s")
    .select("key", "s.*")

// to order the columns the way you want:
val new_result = result.select("Time", "key", "value", "value2")
// or
val new_result = result.select("Time", df.columns.filter(_ != "time") : _*)