Home > Mobile >  SparkSQL - Difference between two time stamps in minutes irrespective of seconds
SparkSQL - Difference between two time stamps in minutes irrespective of seconds

Time:05-19

I am trying to convert sql version of a code to pyspark version. There is a requirement to get difference in minutes from 2 time stamps.

Inputs:

------------------- ------------------- -
|date_time_closed   |date_time_opened   |
 ------------------- ------------------- 
|2021-11-23 18:30:10|2021-11-23 18:25:46|
 ------------------- -------------------

Output required is 5.

SQL version of code is

avg(DATEDIFF(minute, date_time_opened, date_time_closed)) as avg_wait_time

Output: avg_wait_time: 5

Spark-Sql: I have tried below spark-sql codes to get value but its calculating seconds also

  1. CAST(avg((unix_timestamp(date_time_closed)-unix_timestamp(date_time_opened))/(60)) as INT) as avg_wait_time
    

Output: avg_wait_time: 4

  1. CAST(avg(((bigint(to_timestamp(date_time_closed)))-(bigint(to_timestamp(date_time_opened))))/(60))as INT) as avg_wait_time

Output: avg_wait_time: 4

is there any way that i can get 5? ie just the minute difference

CodePudding user response:

You can ceil the results as below to convert it to the nearest integer value

Data Preparation

s = StringIO("""
date_time_closed    date_time_opened
2021-11-23 18:30:10 2021-11-23 18:25:46
""")

df = pd.read_csv(s,delimiter='\t')

sparkDF = sql.createDataFrame(df)

sparkDF.show(truncate=False)

sparkDF = sparkDF.withColumn('date_time_closed_epoch',F.unix_timestamp(F.col('date_time_closed'),'yyyy-MM-dd HH:mm:ss'))\
                 .withColumn('date_time_opened_epoch',F.unix_timestamp(F.col('date_time_opened'),'yyyy-MM-dd HH:mm:ss'))

sparkDF.show()

 ------------------- ------------------- 
|date_time_closed   |date_time_opened   |
 ------------------- ------------------- 
|2021-11-23 18:30:10|2021-11-23 18:25:46|
 ------------------- ------------------- 

 ------------------- ------------------- ---------------------- ---------------------- 
|   date_time_closed|   date_time_opened|date_time_closed_epoch|date_time_opened_epoch|
 ------------------- ------------------- ---------------------- ---------------------- 
|2021-11-23 18:30:10|2021-11-23 18:25:46|            1637672410|            1637672146|
 ------------------- ------------------- ---------------------- ---------------------- 

Ceil

sparkDF = sparkDF.withColumn('avg_wait_time',
                                (F.col('date_time_closed_epoch') - F.col('date_time_opened_epoch'))/60
                            )\
                .withColumn('avg_wait_time_ceil',F.ceil(
                                (F.col('date_time_closed_epoch') - F.col('date_time_opened_epoch'))/60)
                            )

sparkDF.show(truncate=False,vertical=True)

-RECORD 0-------------------------------------
 date_time_closed       | 2021-11-23 18:30:10 
 date_time_opened       | 2021-11-23 18:25:46 
 date_time_closed_epoch | 1637672410          
 date_time_opened_epoch | 1637672146          
 avg_wait_time          | 4.4                 
 avg_wait_time_ceil     | 5

CodePudding user response:

Dont mess around with fiddly epoch calculations. Just tell it you want to have the value precise to the minute and subtract those.

=> SELECT date_trunc('minute', current_timestamp);
       date_trunc       
────────────────────────
 2022-05-19 08:44:00 00
(1 row)
  • Related