df.show
------- ----- ---------- -------- ------------ ----
| id| val| date| time| use |flag|
------- ----- ---------- -------- ------------ ----
|8200732| 1|2015-01-06|11:48:30|30065.221532| 0|
|8200733| 1|2015-01-06|11:48:40|30065.225763| 0|
|8200734| 1|2015-01-06|11:48:50|30065.229994| 0|
|8200735| 1|2015-01-06|11:49:00|30065.234225| 0|
I am trying to get the average use for each date value. Here is what I try:
df.select("date",max($"use")).show()
<console>:26: error: overloaded method value select with alternatives:
[U1, U2](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1], c2: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U2])org.apache.spark.sql.Dataset[(U1, U2)] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (String, org.apache.spark.sql.Column)
I am not sure what I am doing wrong, I have tried to re-write this many times but each time I get an error. I can get the max value for just the use but trying to get the max value of use for each date is causing me issues.
I can not use SparkSQL or pySpark for this.
CodePudding user response:
That's because you're not using either of the overloaded options for select
method on dataframe. The one that you're using is:
df.select("date",max($"use")).show()
And if you notice, "date"
is a String literal, while max($"user")
is a Column
. You should try to use the date column instead of literal date string:
// notice the $ before date here
df.select($"date",max($"use")).show()
CodePudding user response:
Here is what you should do to get the average use for each date value:
df.groupBy("date").agg(mean("use")).show()