I have a (very) large dataset partitioned by year
, month
and day
. The partition columns were derived from a updated_at
column during ingestion. Here is how it looks like:
id | user | updated_at | year | month | day |
---|---|---|---|---|---|
1 | a | 1992-01-19 | 1992 | 1 | 19 |
2 | c | 1992-01-20 | 1992 | 1 | 20 |
3 | a | 1992-01-21 | 1992 | 1 | 21 |
... | ... | ... | ... | ... | ... |
720987 | c | 2012-07-20 | 2012 | 7 | 20 |
720988 | a | 2012-07-21 | 2012 | 7 | 21 |
... | ... | ... | ... | ... | ... |
I need to use Apache Spark to find the 5th earliest event per user.
A simple window function like the one below is impossible since I use a shared cluster and I won't have enough resources to do in-memory processing at any given time due to the size of the dataset.
window = Window.partitionBy("user").orderBy(F.asc("updated_at"))
.withColumn("rank", F.dense_rank().over(window))
.filter(F.col("rank") == 5)
I am considering looping through partitions, processing and persisting data to disk, and then merging them back. How would you solve it? Thanks!
CodePudding user response:
I think the code below will be faster because data is partitioned by these cols and spark can benefit from data locality.
Window.partitionBy("user").orderBy(F.asc("year"), F.asc("month"), F.asc("day"))