Suppose you have PySpark dataframe with an integer type column days
and an datetype column start
who looks like this:
start days
--------------------------------------
2021-04-08 02:49:09 8
2021-04-08 02:49:09 9
2020-05-05 08:43:02 7
What I want to become is a new column end
who would look like this:
start days end
---------------------------------------------------------
2021-04-08 02:49:09 8 2021-04-16 02:49:09
2021-04-08 02:49:09 9 2021-04-17 02:49:09
2020-05-05 08:43:02 7 2020-05-12 08:43:02
CodePudding user response:
You can convert days
into IntervalType
and then add it to the start
column.
from pyspark.sql import functions as F
data = [("2021-04-08 02:49:09", 8), ("2021-04-08 02:49:09", 9, ), ("2020-05-05 08:43:02", 7, )]
df = spark.createDataFrame(data, ("start", "days")).withColumn("start", F.to_timestamp("start", 'yyyy-MM-dd HH:mm:ss'))
df.withColumn("end", F.col("start") F.concat(F.col("days"), F.lit(" days")).cast("interval")).show()
Output
------------------- ---- -------------------
| start|days| end|
------------------- ---- -------------------
|2021-04-08 02:49:09| 8|2021-04-16 02:49:09|
|2021-04-08 02:49:09| 9|2021-04-17 02:49:09|
|2020-05-05 08:43:02| 7|2020-05-12 08:43:02|
------------------- ---- -------------------
CodePudding user response:
Use date_add
function to add days and concatenate the hour part to the resulting date:
from pyspark.sql import functions as F
df2 = df.withColumn(
"end",
F.concat(
F.expr("date_add(start, days)"),
F.date_format("start", " HH:mm:ss")
).cast("timestamp")
)
df2.show()
# ------------------- ---- -------------------
#|start |days|end |
# ------------------- ---- -------------------
#|2021-04-08 02:49:09|8 |2021-04-16 02:49:09|
#|2021-04-08 02:49:09|9 |2021-04-17 02:49:09|
#|2020-05-05 08:43:02|7 |2020-05-12 08:43:02|
# ------------------- ---- -------------------
CodePudding user response:
It's actually easy to do this with dates but keeping the time is harder. We use a sql expression so that we can get the columns to play together. Here I use lots of whitespace for formatting.(runnable code below)
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("2021-04-08 02:49:09",8),
("2021-04-08 02:49:09",9),
("2020-05-05 08:43:02",7
],
[
"start",
"days"
]
)
df.withColumn(
'end',
F.to_timestamp(
F.expr("date_add(start,days) || ' ' ||split(start,' ')[1]")
)
).show()
------------------- ---- -------------------
| start|days| end|
------------------- ---- -------------------
|2021-04-08 02:49:09| 8|2021-04-16 02:49:09|
|2021-04-08 02:49:09| 9|2021-04-17 02:49:09|
|2020-05-05 08:43:02| 7|2020-05-12 08:43:02|
------------------- ---- -------------------
Code in runnable format
df = spark.createDataFrame([("2021-04-08 02:49:09",8),("2021-04-08 02:49:09",9),("2020-05-05 08:43:02",7)],["start","days"])
import pyspark.sql.functions as F
df.withColumn('end',F.to_timestamp(F.expr("date_add(start,days) || ' ' ||split(start,' ')[1]"))).show()