I want to calculate row wise mean and subract mean from each value of row and get maximum at he end
here is my dataframe
col1 | col2 | col3
0 | 2 | 3
4 | 2 | 3
1 | 0 | 3
0 | 0 | 0
df=df.withColumn("mean_value",(sum(col(x) for x in df.columns[0:2]) / 3).alias("mean"))
i can calculate row wise mean with line of code , but i want to minus mean value from each value of row and get the maximum value of row after subtraction of mean value.
Requeire results
col1 | col2 | col3 mean_Value Max_difference_Value
0 | 2 | 3 1.66 1.34
4 | 2 | 3 3.0 1.0
1 | 0 | 3 1.33 1.67
1 | 0 | 1 0.66 0.66
Note this is main formula: abs(mean-columns value).max()
CodePudding user response:
Using greatest
and list comprehension.
spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3']). \
withColumn('mean_value', (sum(func.col(x) for x in ['col1', 'col2', 'col3']) / 3)). \
withColumn('max_diff_val',
func.greatest(*[func.abs(func.col(x) - func.col('mean_value')) for x in ['col1', 'col2', 'col3']])
). \
show()
# ---- ---- ---- ------------------ ------------------
# |col1|col2|col3| mean_value| max_diff_val|
# ---- ---- ---- ------------------ ------------------
# | 0| 2| 3|1.6666666666666667|1.6666666666666667|
# | 4| 2| 3| 3.0| 1.0|
# | 1| 0| 3|1.3333333333333333|1.6666666666666667|
# | 0| 0| 0| 0.0| 0.0|
# ---- ---- ---- ------------------ ------------------
CodePudding user response:
Have you tried UDFs?
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
import numpy as np
@udf
def udf_mean(col1, col2, col3):
return np.mean([col1, col2, col3])
df = df.withColumn("mean_value", udf_mean(col1, col2, col3))
Similarly you can try for max difference value.