Spark/Pyspark Efficiency of Union Window vs Join Window dropDuplicates-CodePudding

I have a task on carrying out a window aggregation across 2 tables/spark dataframes (i.e., looking back and aggregating along t-x days from table 1 based on an input date t from table 2)

I currently have thought of 2 different methods to tackle this issue - to either:

Carry out an outer join between the 2 dataframes (generate A rows) before window aggregation
Carry out a unionByName between the 2 dataframes (generate B rows, where B is always > A, often by ~1.5 - 1.75x) before window aggregation

Also, I'm currently using a workaround for the allowMissingColumns parameter of the unionByName method as my version of spark is currently not updated to ≥ Version 3.1.0. The workaround is based off creating columns in each dataframe that are not present in the other before carrying out unionByName (based off this post).

I currently understand that join operations in spark is generally inefficient as they tend to have to order and shuffle the dataset before combining, whereas union operations directly combine the datasets. However, in my scenario, as the union method outputs a larger dataframe than the join method (B > A), would a window aggregation on a larger dataset kind of negate the benefits reaped by a faster union method, where overall the join method might be faster? (in terms of scalability)

**P.S. The join method also includes an additional dropDuplicates step for one of the columns post window aggregation

CodePudding user response：

You concerns are legit. But the answer would depend on your data size/ structure. Only trying out both methods you would know for sure which is better in your case. For different situations, you would have different answers.

From the first glance, I would tell that "Union Window" would be better, the other one ("Join Window dropDuplicates") contains 2 additional resource-intensive transformations - join and dropDuplicates.

Also, pay attention to whether your windows are well partitioned. If they don't create skewed windows (some containing low number of rows while other - huge) in the bigger (unionized) dataframe, then you seem safe choosing "union window".