How to subract row wise mean from each value of column and get one row wise max after subracting mea-CodePudding

I want to calculate row wise mean and subract mean from each value of row and get maximum at he end

here is my dataframe 


 col1 | col2 | col3
  0   |  2   |  3
  4   |  2   |  3
  1   |  0   |  3
  0   |  0   |  0

df=df.withColumn("mean_value",(sum(col(x) for x in df.columns[0:2]) / 3).alias("mean"))

i can calculate row wise mean with line of code , but i want to minus mean value from each value of row and get the maximum value of row after subtraction of mean value.

Requeire results

 col1 | col2 | col3   mean_Value    Max_difference_Value
  0   |  2   |  3        1.66             1.34
  4   |  2   |  3        3.0              1.0
  1   |  0   |  3        1.33             1.67
  1   |  0   |  1        0.66             0.66

Note this is main formula: abs(mean-columns value).max()

CodePudding user response：

Using greatest and list comprehension.

spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3']). \
    withColumn('mean_value', (sum(func.col(x) for x in ['col1', 'col2', 'col3']) / 3)). \
    withColumn('max_diff_val', 
               func.greatest(*[func.abs(func.col(x) - func.col('mean_value')) for x in ['col1', 'col2', 'col3']])
               ). \
    show()

#  ---- ---- ---- ------------------ ------------------ 
# |col1|col2|col3|        mean_value|      max_diff_val|
#  ---- ---- ---- ------------------ ------------------ 
# |   0|   2|   3|1.6666666666666667|1.6666666666666667|
# |   4|   2|   3|               3.0|               1.0|
# |   1|   0|   3|1.3333333333333333|1.6666666666666667|
# |   0|   0|   0|               0.0|               0.0|
#  ---- ---- ---- ------------------ ------------------

CodePudding user response：

Have you tried UDFs?

from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
import numpy as np

@udf
def udf_mean(col1, col2, col3):
    return np.mean([col1, col2, col3])

df = df.withColumn("mean_value", udf_mean(col1, col2, col3))

Similarly you can try for max difference value.