Home > Software design >  How to agg a column to get standard dev in pyspark
How to agg a column to get standard dev in pyspark

Time:02-03

I have a data frame which I groupBy Date and name and I aggregate the average of price.

 df.filter("some filter")
    .withColumn("price_int", df.print.cast('integer')) 
    .groupBy("Date", 'name')
    .agg(avg(col('price_int')))

I want to get the max, and stddev of price_int column.

 df.filter("some filter")
    .withColumn("price_int", df.print.cast('integer')) 
    .groupBy("Date", 'name')
    .agg(avg(col('price_int'),
 max(col('price_int'),
 stddev(col('price_int'),
 )))

But when I add max(col('price_int), I get error saying Column is not iterable And when I add stddev(col('price_int), I get error saying 'stdev is not defined'

Can you please tell me how can I get 'max' and 'standard dev' of the column 'price_int'?

CodePudding user response:

I'm skeptical about where the issue is, as the code you provided seems to have syntax messed up, but in any case - the below would work:

from pyspark.sql import functions as F

df.filter("some filter")
    .withColumn("price_int", df.print.cast('integer')) 
    .groupBy("Date", 'name')
    .agg(F.avg(F.col('price_int')).alias("Average"), F.stddev(F.col("price_int")).alias("Std_Deviation"), F.max(F.col("price_int")).alias("Maximum"))

Worked for me:

Input/Output

Also, note Spark has 2 different standard deviation functions:

stddev() or stddev_samp() - returns the unbiased sample standard deviation of the expression in a group

stddev_pop() - returns population standard deviation of the expression in a group.

  • Related