Home > OS >  Date filtering in PySpark using between
Date filtering in PySpark using between

Time:07-06

I have a Spark dataframe with date columns. The "date1col" last entry is today and the "date2col" has the last entry of 10 days ago. I need to filter the dates for the last two weeks up to yesterday.

I used df.filter(col('date1col').between(current_date()-1,current_date()-15), and it worked fine. However, when I used the same syntax for the second date column, i.e.: df.filter(col('date2col').between(current_date()-1,current_date()-15) , it returned an empty sdf. When I used df.filter(col('startdate')>current_date()-15), it worked. But my dataframe is dynamic, meaning it updates daily at 9am. How can I force the between function to return the same sdf like I am using the > logic?

CodePudding user response:

Switch the order:
first - earlier date
second - later date

df.filter(col('date2col').between(current_date()-15, current_date()-1))

They are not the same, it can be proved using sameSemantics

df1 = df.filter(col('date2col').between(current_date()-15, current_date()-1))
df2 = df.filter(col('date2col').between(current_date()-1, current_date()-15))
df1.sameSemantics(df2)
# False

If you still need - .between translated to "<>" logic:

df.filter(col('date2col').between(current_date()-15, current_date()-1))
=
df.filter((col('date2col') >= current_date()-15) & (col('date2col') <= current_date()-1))

df1 = df.filter(col('date2col').between(current_date()-15, current_date()-1))
df2 = df.filter((col('date2col') >= current_date()-15) & (col('date2col') <= current_date()-1))

print(df1.sameSemantics(df2)) # `True` when the logical query plans are equal, thus same results
# True

"<>" translated to .between

df.filter(col('date2col') > current_date()-15)
= df.filter(col('date2col').between(current_date()-14, '9999'))

sameSemantics result would apparently be False, but for any practical case, results would be same.

  • Related