I have a Spark dataframe with date columns. The "date1col" last entry is today and the "date2col" has the last entry of 10 days ago. I need to filter the dates for the last two weeks up to yesterday.
I used df.filter(col('date1col').between(current_date()-1,current_date()-15)
, and it worked fine. However, when I used the same syntax for the second date column, i.e.: df.filter(col('date2col').between(current_date()-1,current_date()-15)
, it returned an empty sdf. When I used df.filter(col('startdate')>current_date()-15)
, it worked. But my dataframe is dynamic, meaning it updates daily at 9am. How can I force the between
function to return the same sdf like I am using the >
logic?
CodePudding user response:
Switch the order:
first - earlier date
second - later date
df.filter(col('date2col').between(current_date()-15, current_date()-1))
They are not the same, it can be proved using sameSemantics
df1 = df.filter(col('date2col').between(current_date()-15, current_date()-1))
df2 = df.filter(col('date2col').between(current_date()-1, current_date()-15))
df1.sameSemantics(df2)
# False
If you still need - .between
translated to "<>" logic:
df.filter(col('date2col').between(current_date()-15, current_date()-1))
=
df.filter((col('date2col') >= current_date()-15) & (col('date2col') <= current_date()-1))
df1 = df.filter(col('date2col').between(current_date()-15, current_date()-1))
df2 = df.filter((col('date2col') >= current_date()-15) & (col('date2col') <= current_date()-1))
print(df1.sameSemantics(df2)) # `True` when the logical query plans are equal, thus same results
# True
"<>" translated to .between
df.filter(col('date2col') > current_date()-15)
= df.filter(col('date2col').between(current_date()-14, '9999'))
sameSemantics
result would apparently be False
, but for any practical case, results would be same.