Pyspark: Get the maximum prediction value for each row in a new column from a dense vector column-CodePudding

I have a pyspark dataframe on which I have applied Random Classifier model (from pyspark.ml.classification import RandomForestClassifier) for a multiclass data.

Now, I have the prediction and probability column (dense vector column). I want the single highest probability in a new column from the available probability column which corresponds to the prediction. Can you please let me know a way?

-------------------- ---------- -------------- 
|         probability|prediction|predictedLabel|
 -------------------- ---------- -------------- 
|[0.04980166062108...|       9.0|          73.0|
|[0.09709955311030...|       2.0|          92.0|
|[0.00206441341895...|       1.0|          97.0|
|[0.01177280567423...|       8.0|          26.0|
|[0.09170364155771...|       4.0|          78.0|
|[0.09332145486133...|       0.0|          95.0|
|[0.15873541380236...|       0.0|          95.0|
|[0.21929050786626...|       0.0|          95.0|
|[0.08840100103254...|       1.0|          97.0|
|[0.06204585465363...|       1.0|          97.0|
|[0.06961837644280...|       1.0|          97.0|
|[0.04529447218955...|       1.0|          97.0|
|[0.02129073891494...|       2.0|          92.0|
|[0.02692350960234...|       1.0|          97.0|
|[0.02676868258573...|       8.0|          26.0|
|[0.01849528482881...|       1.0|          97.0|
|[0.10405735702064...|       1.0|          97.0|
|[0.01636762299564...|       1.0|          97.0|
|[0.01739759717529...|       1.0|          97.0|
|[0.02129073891494...|       2.0|          92.0|
 -------------------- ---------- --------------

CodePudding user response：

You have a dense array column, so using array_max (from pyspark.sql.functions.array_max) makes more sense.

Example (using documentation)

from pyspark.sql.functions import array_max
df = spark.createDataFrame([([0.04, 0.03, 0.01],), ([0.09, 0.05, 0.09],)], ['probability'])
df = df.withColumn("max_prob",array_max(df.probability))
df.show()

Update: can you use vector_to_array() before array max, like

from pyspark.sql.functions import array_max
from pyspark.ml.functions import vector_to_array
df = df.withColumn("max_prob",array_max(vector_to_array(df.probability)))

This should give you

 ------------------ -------- 
|       probability|max_prob|
 ------------------ -------- 
|[0.04, 0.03, 0.01]|    0.04|
|[0.09, 0.05, 0.09]|    0.09|
 ------------------ --------