I have a pyspark dataframe like:
A | B
-----
1 | 2
2 | 3
3 | 4
(values totally arbitrary)
I want a dataframe that is exactly like that but has another column which is SUM(B), like overall sum of column B without any grouping or anything. The result would be:
A | B | SUM(B)
--------------
1 | 2 | 9
2 | 3 | 9
3 | 4 | 9
How can this be achieved? I tried something like:
df = df.alias('df') # alias is df
df = df.select(
'df.*',
sum(col('df.B'))
)
But this does not work since it complains that I am doing sum without grouping. Is this possible some way in Pyspark?
CodePudding user response:
Use sum over a Window:
from pyspark.sql import Window
import pyspark.sql.functions as F
df.withColumn(
"SUM_B",
F.sum(F.col("B")).over(Window.orderBy(F.lit(None)))
).show()
# --- --- -----
#| A| B|SUM_B|
# --- --- -----
#| 1| 2| 9|
#| 2| 3| 9|
#| 3| 4| 9|
# --- --- -----
CodePudding user response:
The obvious approach would be to use a window function like this:
win = Window.orderBy(f.lit(1))
df.withColumn("SUM(B)", f.sum("B").over(win)).show()
Yet, you would obtain the following warning
WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
It means that by using a window function without partionning causes the entire dataframe to be sent to one single executor. It is exactly what we want to avoid with spark for obvious performance reasons.
A better solution, that does not involve sending all the data on one executor, would be to compute the sum and then add it to the dataframe with lit
like this:
sum_b = df.select(f.sum("B")).first()[0]
df.withColumn("SUM(B)", f.lit(sum_b)).show()