Home > database >  Create multiple columns over the same window
Create multiple columns over the same window

Time:05-12

The following code is pretty slow.
Is there a way of creating multiple columns at once over the same window, so Spark does not need to partition and order the data multiple times?

w = Window().partitionBy("k").orderBy("t")

df = df.withColumn(F.col("a"), F.last("a",True).over(w))
df = df.withColumn(F.col("b"), F.last("b",True).over(w))
df = df.withColumn(F.col("c"), F.last("c",True).over(w))
...

CodePudding user response:

I'm not sure that Spark does partitioning and reordering several times, as you use the same window consecutively. However, .select is usually a better alternative than .withColumn.

df = df.select(
    "*",
    F.last("a", True).over(w).alias("a"),
    F.last("b", True).over(w).alias("b"),
    F.last("c", True).over(w).alias("c"),
)

To find out if partitioning and ordering is done several times, you need to analyse the df.explain() results.

CodePudding user response:

You dont have to generate one column at a time. Use list comprehension. Code below

new=['a','b','c']
df = df.select(
    "*", *[F.last(x, True).over(w).alias(f"{x}") for x in new]
    
)
  • Related