Home > Net >  Combining date / time string columns in pyspark to get one datetime column
Combining date / time string columns in pyspark to get one datetime column

Time:02-25

I need to do a subtraction with datetime to get a time elapsed column. I was able to combine the separate date and time columns into two combined columns called pickup and dropoff. However, I cannot successfully get these columns into a datetime type column. Below, 'pickup' and 'dropoff' are strings. Is there a way to get these columns into a datetime type?

I've been struggling since this does not include am/pm. The pyspark dataframe is shown below. Thanks!

df.show()

 ----------- ----------- ------------ ------------ -------- ---- ----- ------------- ------------- 
|pickup_date|pickup_time|dropoff_date|dropoff_time|distance| tip| fare|       pickup|      dropoff|
 ----------- ----------- ------------ ------------ -------- ---- ----- ------------- ------------- 
|   1/1/2017|       0:00|    1/1/2017|        0:00|    0.02|   0| 52.8|1/1/2017 0:00|1/1/2017 0:00|
|   1/1/2017|       0:00|    1/1/2017|        0:03|     0.5|   0|  5.3|1/1/2017 0:00|1/1/2017 0:03|
|   1/1/2017|       0:00|    1/1/2017|        0:39|    7.75|4.66|27.96|1/1/2017 0:00|1/1/2017 0:39|
|   1/1/2017|       0:00|    1/1/2017|        0:06|     0.8|1.45| 8.75|1/1/2017 0:00|1/1/2017 0:06|
|   1/1/2017|       0:00|    1/1/2017|        0:08|     0.9|   0|  8.3|1/1/2017 0:00|1/1/2017 0:08|
|   1/1/2017|       0:00|    1/1/2017|        0:05|    1.76|   0|  8.3|1/1/2017 0:00|1/1/2017 0:05|
|   1/1/2017|       0:00|    1/1/2017|        0:15|    8.47|7.71|38.55|1/1/2017 0:00|1/1/2017 0:15|
|   1/1/2017|       0:00|    1/1/2017|        0:11|     2.4|   0| 11.8|1/1/2017 0:00|1/1/2017 0:11|

CodePudding user response:

convert the string timestamp to timestamp data type and subtract.

code:

import org.apache.spark.sql.functions.{col, to_timestamp}
import org.apache.spark.sql.types.{LongType, TimestampType}

  val data = Seq(("1/1/2017 0:00", "1/1/2017 0:35"))

  val df = data.toDF("pickup_dt", "drop_dt")

  df
    .withColumn("pickup_dt", to_timestamp(col("pickup_dt"), "d/M/yyyy H:mm"))
    .withColumn("drop_dt", to_timestamp(col("drop_dt"), "d/M/yyyy H:mm"))
    .withColumn("diff", (col("drop_dt").cast(LongType) - col("pickup_dt").cast(LongType)) / 60)
    .show(false)

output:

 ------------------- ------------------- ---- 
|pickup_dt          |drop_dt            |diff|
 ------------------- ------------------- ---- 
|2017-01-01 00:00:00|2017-01-01 00:35:00|35.0|
 ------------------- ------------------- ---- 

Pyspark:

from pyspark.sql.functions import col, to_timestamp
  df.withColumn(
    "diff",
    (col("drop_dt").cast("long") - col("pickup_dt").cast("long"))/60.
  ).show(truncate=False)
  • Related