Home > Software design >  How to subract row wise mean from each value of column and get one row wise max after subracting mea
How to subract row wise mean from each value of column and get one row wise max after subracting mea

Time:08-22

I want to calculate row wise mean and subract mean from each value of row and get maximum at he end

here is my dataframe 


 col1 | col2 | col3
  0   |  2   |  3
  4   |  2   |  3
  1   |  0   |  3
  0   |  0   |  0

df=df.withColumn("mean_value",(sum(col(x) for x in df.columns[0:2]) / 3).alias("mean"))

i can calculate row wise mean with line of code , but i want to minus mean value from each value of row and get the maximum value of row after subtraction of mean value.

Requeire results

 col1 | col2 | col3   mean_Value    Max_difference_Value
  0   |  2   |  3        1.66             1.34
  4   |  2   |  3        3.0              1.0
  1   |  0   |  3        1.33             1.67
  1   |  0   |  1        0.66             0.66

Note this is main formula: abs(mean-columns value).max()

CodePudding user response:

Using greatest and list comprehension.

spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3']). \
    withColumn('mean_value', (sum(func.col(x) for x in ['col1', 'col2', 'col3']) / 3)). \
    withColumn('max_diff_val', 
               func.greatest(*[func.abs(func.col(x) - func.col('mean_value')) for x in ['col1', 'col2', 'col3']])
               ). \
    show()

#  ---- ---- ---- ------------------ ------------------ 
# |col1|col2|col3|        mean_value|      max_diff_val|
#  ---- ---- ---- ------------------ ------------------ 
# |   0|   2|   3|1.6666666666666667|1.6666666666666667|
# |   4|   2|   3|               3.0|               1.0|
# |   1|   0|   3|1.3333333333333333|1.6666666666666667|
# |   0|   0|   0|               0.0|               0.0|
#  ---- ---- ---- ------------------ ------------------ 

CodePudding user response:

Have you tried UDFs?

from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
import numpy as np

@udf
def udf_mean(col1, col2, col3):
    return np.mean([col1, col2, col3])

df = df.withColumn("mean_value", udf_mean(col1, col2, col3))

Similarly you can try for max difference value.

  • Related