I have a dataframe with start_time
(timestamp), ev_date
(in int) columns. I'm trying to figure out if the date in these two columns are consecutive. I only need to compare if the date is consecutive within each individual column. I'm thinking of using lag
function but not sure how to implement it. How can I find out the missing dates? Many thanks.
Input:
ev_date | start_time
---------------------
20220301| 2022-02-28 10:09:21.356782917
20220301| 2022-02-28 03:09:21.756982919
20220302| 2022-03-01 03:09:21.756982919
20220303| 2022-03-02 03:09:21.756982919
20220305| 2022-03-02 03:09:21.756982919 --ev_date is not right here as 20020304 is missing
20220306| 2022-03-06 03:09:21.756982919 --start_time is not right as it jumped from 03-02 to 03-06
CodePudding user response:
you can add a new column with the difference and then if filter for where the diff is bigger than 1 day. something like (python but similar for scala - not sure which lang you need from the tags)
from pyspark.sql.functions import *
from pyspark.sql.window import Window
df1= df.withColumn("diff", datediff(df.ev_date, lag(df.ev_date, 1)
.over(Window.partitionBy("some other column")
.orderBy("ev_date"))))
df1.filter(df1.diff > 1)