I have a pyspark dataframe on which I have applied Random Classifier model (from pyspark.ml.classification import RandomForestClassifier) for a multiclass data.
Now, I have the prediction and probability column (dense vector column). I want the single highest probability in a new column from the available probability column which corresponds to the prediction. Can you please let me know a way?
-------------------- ---------- --------------
| probability|prediction|predictedLabel|
-------------------- ---------- --------------
|[0.04980166062108...| 9.0| 73.0|
|[0.09709955311030...| 2.0| 92.0|
|[0.00206441341895...| 1.0| 97.0|
|[0.01177280567423...| 8.0| 26.0|
|[0.09170364155771...| 4.0| 78.0|
|[0.09332145486133...| 0.0| 95.0|
|[0.15873541380236...| 0.0| 95.0|
|[0.21929050786626...| 0.0| 95.0|
|[0.08840100103254...| 1.0| 97.0|
|[0.06204585465363...| 1.0| 97.0|
|[0.06961837644280...| 1.0| 97.0|
|[0.04529447218955...| 1.0| 97.0|
|[0.02129073891494...| 2.0| 92.0|
|[0.02692350960234...| 1.0| 97.0|
|[0.02676868258573...| 8.0| 26.0|
|[0.01849528482881...| 1.0| 97.0|
|[0.10405735702064...| 1.0| 97.0|
|[0.01636762299564...| 1.0| 97.0|
|[0.01739759717529...| 1.0| 97.0|
|[0.02129073891494...| 2.0| 92.0|
-------------------- ---------- --------------
CodePudding user response:
You have a dense array column, so using array_max
(from pyspark.sql.functions.array_max) makes more sense.
Example (using documentation)
from pyspark.sql.functions import array_max
df = spark.createDataFrame([([0.04, 0.03, 0.01],), ([0.09, 0.05, 0.09],)], ['probability'])
df = df.withColumn("max_prob",array_max(df.probability))
df.show()
Update: can you use vector_to_array()
before array max, like
from pyspark.sql.functions import array_max
from pyspark.ml.functions import vector_to_array
df = df.withColumn("max_prob",array_max(vector_to_array(df.probability)))
This should give you
------------------ --------
| probability|max_prob|
------------------ --------
|[0.04, 0.03, 0.01]| 0.04|
|[0.09, 0.05, 0.09]| 0.09|
------------------ --------