I need to do a subtraction with datetime to get a time elapsed column. I was able to combine the separate date and time columns into two combined columns called pickup and dropoff. However, I cannot successfully get these columns into a datetime type column. Below, 'pickup' and 'dropoff' are strings. Is there a way to get these columns into a datetime type?
I've been struggling since this does not include am/pm. The pyspark dataframe is shown below. Thanks!
df.show()
----------- ----------- ------------ ------------ -------- ---- ----- ------------- -------------
|pickup_date|pickup_time|dropoff_date|dropoff_time|distance| tip| fare| pickup| dropoff|
----------- ----------- ------------ ------------ -------- ---- ----- ------------- -------------
| 1/1/2017| 0:00| 1/1/2017| 0:00| 0.02| 0| 52.8|1/1/2017 0:00|1/1/2017 0:00|
| 1/1/2017| 0:00| 1/1/2017| 0:03| 0.5| 0| 5.3|1/1/2017 0:00|1/1/2017 0:03|
| 1/1/2017| 0:00| 1/1/2017| 0:39| 7.75|4.66|27.96|1/1/2017 0:00|1/1/2017 0:39|
| 1/1/2017| 0:00| 1/1/2017| 0:06| 0.8|1.45| 8.75|1/1/2017 0:00|1/1/2017 0:06|
| 1/1/2017| 0:00| 1/1/2017| 0:08| 0.9| 0| 8.3|1/1/2017 0:00|1/1/2017 0:08|
| 1/1/2017| 0:00| 1/1/2017| 0:05| 1.76| 0| 8.3|1/1/2017 0:00|1/1/2017 0:05|
| 1/1/2017| 0:00| 1/1/2017| 0:15| 8.47|7.71|38.55|1/1/2017 0:00|1/1/2017 0:15|
| 1/1/2017| 0:00| 1/1/2017| 0:11| 2.4| 0| 11.8|1/1/2017 0:00|1/1/2017 0:11|
CodePudding user response:
convert the string timestamp to timestamp data type and subtract.
code:
import org.apache.spark.sql.functions.{col, to_timestamp}
import org.apache.spark.sql.types.{LongType, TimestampType}
val data = Seq(("1/1/2017 0:00", "1/1/2017 0:35"))
val df = data.toDF("pickup_dt", "drop_dt")
df
.withColumn("pickup_dt", to_timestamp(col("pickup_dt"), "d/M/yyyy H:mm"))
.withColumn("drop_dt", to_timestamp(col("drop_dt"), "d/M/yyyy H:mm"))
.withColumn("diff", (col("drop_dt").cast(LongType) - col("pickup_dt").cast(LongType)) / 60)
.show(false)
output:
------------------- ------------------- ----
|pickup_dt |drop_dt |diff|
------------------- ------------------- ----
|2017-01-01 00:00:00|2017-01-01 00:35:00|35.0|
------------------- ------------------- ----
Pyspark:
from pyspark.sql.functions import col, to_timestamp
df.withColumn(
"diff",
(col("drop_dt").cast("long") - col("pickup_dt").cast("long"))/60.
).show(truncate=False)