Home > OS >  Spark - how to check if dates are not consecutive
Spark - how to check if dates are not consecutive

Time:03-25

I have a dataframe with start_time (timestamp), ev_date(in int) columns. I'm trying to figure out if the date in these two columns are consecutive. I only need to compare if the date is consecutive within each individual column. I'm thinking of using lag function but not sure how to implement it. How can I find out the missing dates? Many thanks.

Input:

ev_date |  start_time
---------------------
20220301| 2022-02-28 10:09:21.356782917
20220301| 2022-02-28 03:09:21.756982919
20220302| 2022-03-01 03:09:21.756982919
20220303| 2022-03-02 03:09:21.756982919
20220305| 2022-03-02 03:09:21.756982919 --ev_date is not right here as 20020304 is missing 
20220306| 2022-03-06 03:09:21.756982919 --start_time is not right as it jumped from 03-02 to 03-06

 

CodePudding user response:

you can add a new column with the difference and then if filter for where the diff is bigger than 1 day. something like (python but similar for scala - not sure which lang you need from the tags)

from pyspark.sql.functions import *
from pyspark.sql.window import Window

df1= df.withColumn("diff", datediff(df.ev_date, lag(df.ev_date, 1)
    .over(Window.partitionBy("some other column")
    .orderBy("ev_date")‌​)))
df1.filter(df1.diff > 1)
  • Related