Home > OS >  How to use a column's name in an expr() instead of direct value for add date in Pyspark?
How to use a column's name in an expr() instead of direct value for add date in Pyspark?

Time:10-07

I want to add the value of a column (add_value) which is based on seconds, to a timestamp column (start_date).

 ----------------------- ---------- 
|       start_date      | add_value|
 ----------------------- ---------- 
|2022-09-05 00:00:01.394|  6.001   |
 ----------------------- ---------- 

Using the below code for direct value works fine.

(col('start_date')   expr('INTERVAL 6.001 seconds')).alias("end_date")

But replacing it with the column's name has an error.

(col('start_date')   expr('INTERVAL add_value seconds')).alias("end_date")

error:

Syntax error at or near 'seconds': extra input 'seconds'(line 1, pos 22)

== SQL == INTERVAL add_value seconds ----------------------^^^

How should I use the column's name?

CodePudding user response:

To create intervals using existing columns you can concatenate 'interval' add_value 'seconds' and cast the resulting string into INTERVAL type. Then you can add it to timestamp start_date:

from pyspark.sql import functions as F

df.withColumn(
    'end_date',
    F.col('start_date')   F.format_string('interval %s seconds', 'add_value').cast('interval')
).show()

#  ----------------------- --------- ----------------------- 
# |start_date             |add_value|end_date               |
#  ----------------------- --------- ----------------------- 
# |2022-09-05 00:00:01.394|6.001    |2022-09-05 00:00:07.395|
#  ----------------------- --------- ----------------------- 

CodePudding user response:

INTERVAL doesn't let you use values from a column. But an easy approach can be that you cast the timestamp as double and then add the decimal seconds. Convert the resulting double value to a timestamp to get the desired format.

spark.sparkContext.parallelize([('2022-09-05 00:00:01.394', 6.001)]).toDF(['start_date', 'add_value']). \
    withColumn('start_date', func.to_timestamp('start_date')). \
    withColumn('start_date_added_value', 
               (func.col('start_date').cast('double')   func.col('add_value')).cast('timestamp')
               ). \
    show(truncate=False)

#  ----------------------- --------- ----------------------- 
# |start_date             |add_value|start_date_added_value |
#  ----------------------- --------- ----------------------- 
# |2022-09-05 00:00:01.394|6.001    |2022-09-05 00:00:07.395|
#  ----------------------- --------- ----------------------- 
  • Related