Add integer column to timestamp column in PySpark dataframe-CodePudding

Suppose you have PySpark dataframe with an integer type column days and an datetype column start who looks like this:

start                      days      
--------------------------------------
2021-04-08 02:49:09          8
2021-04-08 02:49:09          9
2020-05-05 08:43:02          7

What I want to become is a new column end who would look like this:

start                      days              end     
---------------------------------------------------------
2021-04-08 02:49:09          8      2021-04-16 02:49:09
2021-04-08 02:49:09          9      2021-04-17 02:49:09
2020-05-05 08:43:02          7      2020-05-12 08:43:02

CodePudding user response：

You can convert days into IntervalType and then add it to the start column.

from pyspark.sql import functions as F

data = [("2021-04-08 02:49:09", 8), ("2021-04-08 02:49:09", 9, ), ("2020-05-05 08:43:02", 7, )]
df = spark.createDataFrame(data, ("start", "days")).withColumn("start", F.to_timestamp("start", 'yyyy-MM-dd HH:mm:ss'))

df.withColumn("end", F.col("start")   F.concat(F.col("days"), F.lit(" days")).cast("interval")).show()

Output

 ------------------- ---- ------------------- 
|              start|days|                end|
 ------------------- ---- ------------------- 
|2021-04-08 02:49:09|   8|2021-04-16 02:49:09|
|2021-04-08 02:49:09|   9|2021-04-17 02:49:09|
|2020-05-05 08:43:02|   7|2020-05-12 08:43:02|
 ------------------- ---- -------------------

CodePudding user response：

Use date_add function to add days and concatenate the hour part to the resulting date:

from pyspark.sql import functions as F

df2 = df.withColumn(
    "end",
    F.concat(
        F.expr("date_add(start, days)"),
        F.date_format("start", " HH:mm:ss")
    ).cast("timestamp")
)

df2.show()
# ------------------- ---- ------------------- 
#|start              |days|end                |
# ------------------- ---- ------------------- 
#|2021-04-08 02:49:09|8   |2021-04-16 02:49:09|
#|2021-04-08 02:49:09|9   |2021-04-17 02:49:09|
#|2020-05-05 08:43:02|7   |2020-05-12 08:43:02|
# ------------------- ---- -------------------

CodePudding user response：

It's actually easy to do this with dates but keeping the time is harder. We use a sql expression so that we can get the columns to play together. Here I use lots of whitespace for formatting.(runnable code below)

import pyspark.sql.functions as F
df = spark.createDataFrame(
  [
     ("2021-04-08 02:49:09",8),
     ("2021-04-08 02:49:09",9),
     ("2020-05-05 08:43:02",7
  ],
  [
     "start",
     "days"
  ]
)
df.withColumn(
  'end',
  F.to_timestamp(
    F.expr("date_add(start,days) || ' ' ||split(start,' ')[1]")
  )
).show()

 ------------------- ---- ------------------- 
|              start|days|                end|
 ------------------- ---- ------------------- 
|2021-04-08 02:49:09|   8|2021-04-16 02:49:09|
|2021-04-08 02:49:09|   9|2021-04-17 02:49:09|
|2020-05-05 08:43:02|   7|2020-05-12 08:43:02|
 ------------------- ---- -------------------

Code in runnable format

df = spark.createDataFrame([("2021-04-08 02:49:09",8),("2021-04-08 02:49:09",9),("2020-05-05 08:43:02",7)],["start","days"])
import pyspark.sql.functions as F
df.withColumn('end',F.to_timestamp(F.expr("date_add(start,days) || ' ' ||split(start,' ')[1]"))).show()