Find id of maximum value and average pyspark dataframe-CodePudding

How do I find the id of maximum value of another column and average value using pyspark dataframe?

df:

 ------------- ------- ---------- --------- ---------- -------- ------ ------------------ 
|ChargingEvent|   CPID| StartDate|StartTime|   EndDate| EndTime|Energy|    PluginDuration|
 ------------- ------- ---------- --------- ---------- -------- ------ ------------------ 
|     16673806|AN11719|2017-12-31| 14:46:00|2017-12-31|18:00:00|   2.4|3.2333333333333334|
|     16670986|AN01706|2017-12-31| 11:25:00|2017-12-31|13:14:00|   6.1|1.8166666666666667|
|      3174961|AN18584|2017-12-31| 11:26:11|2018-01-01|12:54:11|    24|25.466666666666665|

Current code:

df.agg({'PluginDuration': 'max'}).show()
df.agg({'PluginDuration': 'avg'}).show()

Then rename the all of the columns to as like the expected outcome below:

 ------------------- ------------------- ------------------ 
|id                 |max_value          |avg_value
 ------------------- ------------------- ------------------- 
| QWER              |96.26              |12.35              |

id rename from CPID. And max_value & avg_value needs to round up to 2 decimal places.

CodePudding user response：

I have basically adapted an SQL approach to fit into the dataframe methods, which works and answers the question.

from pyspark.sql import functions as F

# get the max and average values from the column
mx = df.agg({'PluginDuration':'max'}).collect()[0][0]
av = df.agg({'PluginDuration':'avg'}).collect()[0][0]

# add max and avg olumns, then select cols with rename
# and then sort by value and limit to top
df\
  .withColumn('max_value', F.lit(round(mx,2)))\
  .withColumn('avg_value', F.lit(round(av,2)))\
  .sort('PluginDuration', ascending = False)\
  .selectExpr('CPID as id', 'max_value', 'avg_value')\
  .limit(1)\
  .show()