Home > front end >  Find nth row per group in a large dataset with Spark
Find nth row per group in a large dataset with Spark

Time:12-19

I have a (very) large dataset partitioned by year, month and day. The partition columns were derived from a updated_at column during ingestion. Here is how it looks like:

id user updated_at year month day
1 a 1992-01-19 1992 1 19
2 c 1992-01-20 1992 1 20
3 a 1992-01-21 1992 1 21
... ... ... ... ... ...
720987 c 2012-07-20 2012 7 20
720988 a 2012-07-21 2012 7 21
... ... ... ... ... ...

I need to use Apache Spark to find the 5th earliest event per user.

A simple window function like the one below is impossible since I use a shared cluster and I won't have enough resources to do in-memory processing at any given time due to the size of the dataset.

window = Window.partitionBy("user").orderBy(F.asc("updated_at"))
.withColumn("rank", F.dense_rank().over(window))
.filter(F.col("rank") == 5)

I am considering looping through partitions, processing and persisting data to disk, and then merging them back. How would you solve it? Thanks!

CodePudding user response:

I think the code below will be faster because data is partitioned by these cols and spark can benefit from data locality.

Window.partitionBy("user").orderBy(F.asc("year"), F.asc("month"), F.asc("day"))
  • Related