I am trying to convert sql version of a code to pyspark version. There is a requirement to get difference in minutes from 2 time stamps.
Inputs:
------------------- ------------------- -
|date_time_closed |date_time_opened |
------------------- -------------------
|2021-11-23 18:30:10|2021-11-23 18:25:46|
------------------- -------------------
Output required is 5.
SQL version of code is
avg(DATEDIFF(minute, date_time_opened, date_time_closed)) as avg_wait_time
Output: avg_wait_time: 5
Spark-Sql: I have tried below spark-sql codes to get value but its calculating seconds also
-
CAST(avg((unix_timestamp(date_time_closed)-unix_timestamp(date_time_opened))/(60)) as INT) as avg_wait_time
Output: avg_wait_time: 4
CAST(avg(((bigint(to_timestamp(date_time_closed)))-(bigint(to_timestamp(date_time_opened))))/(60))as INT) as avg_wait_time
Output: avg_wait_time: 4
is there any way that i can get 5? ie just the minute difference
CodePudding user response:
You can ceil the results as below to convert it to the nearest integer value
Data Preparation
s = StringIO("""
date_time_closed date_time_opened
2021-11-23 18:30:10 2021-11-23 18:25:46
""")
df = pd.read_csv(s,delimiter='\t')
sparkDF = sql.createDataFrame(df)
sparkDF.show(truncate=False)
sparkDF = sparkDF.withColumn('date_time_closed_epoch',F.unix_timestamp(F.col('date_time_closed'),'yyyy-MM-dd HH:mm:ss'))\
.withColumn('date_time_opened_epoch',F.unix_timestamp(F.col('date_time_opened'),'yyyy-MM-dd HH:mm:ss'))
sparkDF.show()
------------------- -------------------
|date_time_closed |date_time_opened |
------------------- -------------------
|2021-11-23 18:30:10|2021-11-23 18:25:46|
------------------- -------------------
------------------- ------------------- ---------------------- ----------------------
| date_time_closed| date_time_opened|date_time_closed_epoch|date_time_opened_epoch|
------------------- ------------------- ---------------------- ----------------------
|2021-11-23 18:30:10|2021-11-23 18:25:46| 1637672410| 1637672146|
------------------- ------------------- ---------------------- ----------------------
Ceil
sparkDF = sparkDF.withColumn('avg_wait_time',
(F.col('date_time_closed_epoch') - F.col('date_time_opened_epoch'))/60
)\
.withColumn('avg_wait_time_ceil',F.ceil(
(F.col('date_time_closed_epoch') - F.col('date_time_opened_epoch'))/60)
)
sparkDF.show(truncate=False,vertical=True)
-RECORD 0-------------------------------------
date_time_closed | 2021-11-23 18:30:10
date_time_opened | 2021-11-23 18:25:46
date_time_closed_epoch | 1637672410
date_time_opened_epoch | 1637672146
avg_wait_time | 4.4
avg_wait_time_ceil | 5
CodePudding user response:
Dont mess around with fiddly epoch calculations. Just tell it you want to have the value precise to the minute and subtract those.
=> SELECT date_trunc('minute', current_timestamp);
date_trunc
────────────────────────
2022-05-19 08:44:00 00
(1 row)