I'm looking for solution how to do following calculation without conversion from spark df to pandas df:
mean = sum(df[A]*df[B])/sum(df[B])
Calculation based on selected columns in spark dataframe can be done by splitting it in pieces like:
new_col = df[A]*df[B]
new_col = sum(new_col)
new_col2 = sum(df[B])
mean = new_col/new_col2
But I hope there are more sophisticated ways to do that. Perhaps by using spark .withColumn function?
CodePudding user response:
You can create a new column with the product and then aggregate:
from pyspark.sql import functions as F
df = spark.createDataFrame(data=[[1, 2, 3], [1, 2, 3], [1, 2, 3]], schema=["A", "B", "C"])
mean = (
df
.withColumn("AB", F.col("A") * F.col("B"))
.groupBy()
.agg(F.mean("AB").alias("AB"), F.mean("B").alias("B"))
.withColumn("mean", F.col("AB") / F.col("B"))
)
mean.show()