Home > Mobile >  Get the most consecutive day from Date column with PySpark
Get the most consecutive day from Date column with PySpark

Time:01-05

Original dataframe:

member_id AccessDate
111111 2020-02-03
111111 2022-03-05
222222 2015-03-04
333333 2021-11-23
333333 2021-11-24
333333 2021-11-25
333333 2022-10-11
333333 2022-10-12
333333 2022-10-13
333333 2022-07-07
444444 2019-01-21
444444 2019-04-21
444444 2019-04-22
444444 2019-04-23
444444 2019-04-24
444444 2019-05-05
444444 2019-05-06
444444 2019-05-07

Result dataframe:

member_id Most_Consecutive_AccessDate total
111111 2022-03-05 1
222222 2015-03-04 1
333333 2022-10-11, 2022-10-12, 2022-10-13 3
444444 2019-04-21, 2019-04-22, 2019-04-23, 2019-04-24 4

Describe my question:

  • Each member has a unique id and Accessdate as the first table.
  • If the Member don't have consecutive day, must get the latest date.
  • If the Member have consecutive day more than 1 period, must get the most consecutive streak.
  • If the Member have the same number of consecutive day, must get the latest consecutive streak.
    • Example from original table member 333333 have consecutive period as 2021-11-23, 2021-11-24, 2021-11-25 and 2022-10-11, 2022-10-12, 2022-10-13. We need only the last period.

I have tried to use it.

df = df.withColumn('Most_Consecutive_AccessDate', collect_list('AccessDate')

try to get the number in the list But it is too far from the expected result

ps. edit for easier reading and wrong typing

CodePudding user response:

This would work: (Using "gap-and-islands" - will add details shortly)

df.select(F.col("member_id"), F.col("AccessDate").cast("Date"))\
.withColumn("row_num", F.row_number().over(Window().orderBy(F.lit('A'))))\
.withColumn("bucket", F.col("AccessDate")-F.col("row_num"))\
.groupBy("bucket", "member_id")\
.agg(F.max("AccessDate").alias("AccessDate"), F.count("*").alias("cnt"))\
.orderBy(F.col("cnt").desc(), F.col("AccessDate").desc())\
.groupBy("member_id")\
.agg(F.first("cnt"), F.first("AccessDate"))\
.show();

To know more and learn about series problems - https://blog.jooq.org/how-to-find-the-longest-consecutive-series-of-events-in-sql/

  • Related