Home > Blockchain >  PySpark sum of column recomputed?
PySpark sum of column recomputed?

Time:11-27

I have a pyspark dataframe like:

A | B
-----
1 | 2
2 | 3
3 | 4 

(values totally arbitrary)

I want a dataframe that is exactly like that but has another column which is SUM(B), like overall sum of column B without any grouping or anything. The result would be:

A | B | SUM(B)
--------------
1 | 2 | 9
2 | 3 | 9
3 | 4 | 9

How can this be achieved? I tried something like:

df = df.alias('df') # alias is df
df = df.select(
   'df.*',
   sum(col('df.B'))
)

But this does not work since it complains that I am doing sum without grouping. Is this possible some way in Pyspark?

CodePudding user response:

Use sum over a Window:

from pyspark.sql import Window
import pyspark.sql.functions as F 

df.withColumn(
    "SUM_B",
    F.sum(F.col("B")).over(Window.orderBy(F.lit(None)))
).show()

# --- --- ----- 
#|  A|  B|SUM_B|
# --- --- ----- 
#|  1|  2|    9|
#|  2|  3|    9|
#|  3|  4|    9|
# --- --- ----- 

CodePudding user response:

The obvious approach would be to use a window function like this:

win = Window.orderBy(f.lit(1))
df.withColumn("SUM(B)", f.sum("B").over(win)).show()

Yet, you would obtain the following warning

WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

It means that by using a window function without partionning causes the entire dataframe to be sent to one single executor. It is exactly what we want to avoid with spark for obvious performance reasons.

A better solution, that does not involve sending all the data on one executor, would be to compute the sum and then add it to the dataframe with lit like this:

sum_b = df.select(f.sum("B")).first()[0]
df.withColumn("SUM(B)", f.lit(sum_b)).show()
  • Related