The following code is pretty slow.
Is there a way of creating multiple columns at once over the same window, so Spark does not need to partition and order the data multiple times?
w = Window().partitionBy("k").orderBy("t")
df = df.withColumn(F.col("a"), F.last("a",True).over(w))
df = df.withColumn(F.col("b"), F.last("b",True).over(w))
df = df.withColumn(F.col("c"), F.last("c",True).over(w))
...
CodePudding user response:
I'm not sure that Spark does partitioning and reordering several times, as you use the same window consecutively. However, .select
is usually a better alternative than .withColumn
.
df = df.select(
"*",
F.last("a", True).over(w).alias("a"),
F.last("b", True).over(w).alias("b"),
F.last("c", True).over(w).alias("c"),
)
To find out if partitioning and ordering is done several times, you need to analyse the df.explain()
results.
CodePudding user response:
You dont have to generate one column at a time. Use list comprehension. Code below
new=['a','b','c']
df = df.select(
"*", *[F.last(x, True).over(w).alias(f"{x}") for x in new]
)