I'm trying to understand behaviour differences between pyspark.sql.currenttimestamp() and datetime.now()
If I create a Spark dataframe in DataBricks using these 2 mechanisms to create a timestamp column, everything works nicely as expected....
curDate2 = spark.range(10)\
.withColumn("current_date_lit",F.lit(date.today()))\
.withColumn("current_timestamp_lit",F.lit(F.current_timestamp()))\
.withColumn("current_timestamp",F.current_timestamp())\
.withColumn("now",F.lit(datetime.now()))
--- ---------------- --------------------- -------------------- --------------------
| id|current_date_lit|current_timestamp_lit| current_timestamp| now|
--- ---------------- --------------------- -------------------- --------------------
| 0| 2022-02-12| 2022-02-12 16:40:...|2022-02-12 16:40:...|2022-02-12 16:40:...|
| 1| 2022-02-12| 2022-02-12 16:40:...|2022-02-12 16:40:...|2022-02-12 16:40:...|
| 2| 2022-02-12| 2022-02-12 16:40:...|2022-02-12 16:40:...|2022-02-12 16:40:...|
--- ---------------- --------------------- -------------------- --------------------
However, when I then call show() on the dataframe a couple of minutes later the columns based on currenttimestamp() show me the time NOW (16:44) whilst the datetime.now() column shows me the timestamp from the first creation of the dataframe (16:40)
Clearly one column holds a literal value & the other enumerates the function at runtime but I'm at a loss to understand why they behave differently
show() a few mins later...
--- ---------------- --------------------- -------------------- --------------------
| id|current_date_lit|current_timestamp_lit| current_timestamp| now|
--- ---------------- --------------------- -------------------- --------------------
| 0| 2022-02-12| 2022-02-12 16:44:...|2022-02-12 16:44:...|2022-02-12 16:40:...|
| 1| 2022-02-12| 2022-02-12 16:44:...|2022-02-12 16:44:...|2022-02-12 16:40:...|
| 2| 2022-02-12| 2022-02-12 16:44:...|2022-02-12 16:44:...|2022-02-12 16:40:...|
--- ---------------- --------------------- -------------------- --------------------
Thanks - I hope this makes sense!
CodePudding user response:
current_timestamp()
returns a TimestampType column, the value of which is evaluated at query time as described in the docs. So that is 'computed' each time your callshow
.
Returns the current timestamp at the start of query evaluation as a TimestampType column. All calls of current_timestamp within the same query return the same value.
- Passing this column to a
lit
call doesn't change anything, if you check the source code you can seelit
simply returns the column you called it with.
return col if isinstance(col, Column) else _invoke_function("lit", col)
- If you cal
lit
with something else than a column, e.g. a datetime object then a new column is created with this literal value. The literal being the datetime object returned from datetime.now(). This is a static value representing the time the datetime.now function was called.
CodePudding user response:
Good question that I tried out with rand()
function just to check. It is sort of intuitive, but at the same time an Action without some prior .cache()
applied to some data, would lead one to believe, a new round --> a new set of results.
show()
is an Action with some smarts. Here is is based on the same underlying rdd and logically one would expect a deterministic outcome - at least I think so.However,
F.current_timestamp()
is evaluated once at serialization time. So, two successiveshow()
's will have 2 different times. The other answer states that and points to the docs. So that is an exception and thus tried withrand()
. See below.Datetime.now()
is held constant by Spark - see WholeStageCodeGen - just how it works as it concerns the same underlying DF; it assumes the first lit function still applies because the preceding creation of the DF (underlying RDD) still exists. I did a check withrand()
and all successiveshow()
Actions return the same sequence of random numbers - the same seed is used. This emulates deterministic behaviour which is what we would want with 2 successiveshow()
'sWith a new DF with same name, then that is also re-evaluated, obviously.
You can try and see what happens if you use .cache().
It is a contrived example range(10)
of course.