Mean calculation from dataframe columns with pyspark-CodePudding

I'm looking for solution how to do following calculation without conversion from spark df to pandas df:

mean = sum(df[A]*df[B])/sum(df[B])

Calculation based on selected columns in spark dataframe can be done by splitting it in pieces like: new_col = df[A]*df[B]
new_col = sum(new_col)
new_col2 = sum(df[B])
mean = new_col/new_col2 But I hope there are more sophisticated ways to do that. Perhaps by using spark .withColumn function?

CodePudding user response：

You can create a new column with the product and then aggregate:

from pyspark.sql import functions as F

df = spark.createDataFrame(data=[[1, 2, 3], [1, 2, 3], [1, 2, 3]], schema=["A", "B", "C"])

mean = (
    df
    .withColumn("AB", F.col("A") * F.col("B"))
    .groupBy()
    .agg(F.mean("AB").alias("AB"), F.mean("B").alias("B"))
    .withColumn("mean", F.col("AB") / F.col("B"))
)

mean.show()