spark scala "Overloaded method value select with alternatives" when trying to get the max-CodePudding

df.show
 ------- ----- ---------- -------- ------------ ---- 
|     id|  val|      date|    time| use        |flag|
 ------- ----- ---------- -------- ------------ ---- 
|8200732|    1|2015-01-06|11:48:30|30065.221532|   0|
|8200733|    1|2015-01-06|11:48:40|30065.225763|   0|
|8200734|    1|2015-01-06|11:48:50|30065.229994|   0|
|8200735|    1|2015-01-06|11:49:00|30065.234225|   0|

I am trying to get the average use for each date value. Here is what I try:

 df.select("date",max($"use")).show()
<console>:26: error: overloaded method value select with alternatives:
  [U1, U2](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1], c2: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U2])org.apache.spark.sql.Dataset[(U1, U2)] <and>
  (col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
 cannot be applied to (String, org.apache.spark.sql.Column)

I am not sure what I am doing wrong, I have tried to re-write this many times but each time I get an error. I can get the max value for just the use but trying to get the max value of use for each date is causing me issues.

I can not use SparkSQL or pySpark for this.

CodePudding user response：

That's because you're not using either of the overloaded options for select method on dataframe. The one that you're using is:

df.select("date",max($"use")).show()

And if you notice, "date" is a String literal, while max($"user") is a Column. You should try to use the date column instead of literal date string:

// notice the $ before date here
df.select($"date",max($"use")).show()

CodePudding user response：

Here is what you should do to get the average use for each date value:

df.groupBy("date").agg(mean("use")).show()