I need help in converting the below function into an SQL query:
start_time :- 1649289600
end_time :- 1649375999
test_data = df.withColumn("from_timestamp",to_timestamp(lit(from_unixtime(col("start_time"),'MM-dd-yyyy HH:mm:ss:SSS')), 'MM-dd-yyyy HH:mm:ss:SSS')) \
.withColumn("to_timestamp",to_timestamp(lit(from_unixtime(col("end_time"),"MM-dd-yyyy HH:mm:ss:SSS")), 'MM-dd-yyyy HH:mm:ss:SSS')) \
.withColumn("DiffInSeconds", col("from_timestamp").cast("long") - col("to_timestamp").cast("long")) \
.withColumn("time_diff_hr", abs(ceil(col("DiffInSeconds")/3600))) \
.withColumn("minDiffInHours", col("DiffInSeconds")/3600 - ceil(col("DiffInSeconds")/3600)) \
.withColumn("time_diff_min",abs(ceil(col("minDiffInHours")*60))) \
.withColumn("minDiffInSec", col("minDiffInHours")*60 - ceil(col("minDiffInHours")*60)) \
.withColumn("time_diff_sec",abs(ceil(col("minDiffInSec")*60)))
I have tried quite a few things like:
df2=sqlContext.sql("SELECT start,end,from_unixtime(cast(start as string), 'MM-dd-yyyy HH:mm:ss:SSS') AS start_date,from_unixtime(cast(end as string), 'MM-dd-yyyy HH:mm:ss:SSS') AS end_date from data1")
df2.show()
df2.createOrReplaceTempView('data2')
df4=sqlContext.sql("SELECT start,end,start_date,end_date,(end_date-start_date) as time_diff from data2")
But whenever I try to find the difference it is returning null
values.
Edit:
Output for data1
--- --- ---- ---------- ---------- -------------------- --------------------
| a| b| c| start| end| start_date| end_date|
--- --- ---- ---------- ---------- -------------------- --------------------
| 1|4.0|GFG1|1649289600|1649375999|04-07-2022 00:00:...|04-07-2022 23:59:...|
--- --- ---- ---------- ---------- -------------------- --------------------
CodePudding user response:
In SQL's SELECT
you can only reference existing columns. PySpark's withColumn
works differently, as it basically creates a new df with additional column, so later you can use that column. Since in SQL you cannot, you must create every new column from start
and date
columns:
df2 = sqlContext.sql(
"""
select
start,
end,
from_unixtime(start, 'yyyy-MM-dd HH:mm:ss') as from_timestamp,
from_unixtime(end, 'yyyy-MM-dd HH:mm:ss') as to_timestamp,
(start - end) as DiffInSeconds,
abs(ceil((start - end) / 3600)) as time_diff_hr,
((start - end) / 3600) - ceil((start - end) / 3600) as minDiffInHours,
abs(ceil((((start - end) / 3600) - ceil((start - end) / 3600)) * 60)) as time_diff_min,
(((start - end) / 3600) - ceil((start - end) / 3600)) * 60 - ceil((((start - end) / 3600) - ceil((start - end) / 3600)) * 60) as minDiffInSec,
abs(ceil(((((start - end) / 3600) - ceil((start - end) / 3600)) * 60 - ceil((((start - end) / 3600) - ceil((start - end) / 3600)) * 60)) * 60)) as time_diff_sec
from data1
"""
)
df2.show()
# ---------- ---------- ------------------- ------------------- ------------- ------------ ------------------- ------------- ------------------- -------------
# | start| end| from_timestamp| to_timestamp|DiffInSeconds|time_diff_hr| minDiffInHours|time_diff_min| minDiffInSec|time_diff_sec|
# ---------- ---------- ------------------- ------------------- ------------- ------------ ------------------- ------------- ------------------- -------------
# |1649289600|1649375999|2022-04-07 00:00:00|2022-04-07 23:59:59| -86399| 23|-0.9997222222222213| 59|-0.9833333333332774| 58|
# ---------- ---------- ------------------- ------------------- ------------- ------------ ------------------- ------------- ------------------- -------------
CodePudding user response:
If the requirement is only to convert into timestamp. See below logic -
spark.sql("select timestamp(from_unixtime(1649289600,'yyyy-MM-dd HH:mm:ss')) as start_time, timestamp(from_unixtime(1649375999,'yyyy-MM-dd HH:mm:ss')) as end_time" ).show(truncate=False)
------------------- -------------------
|start_time |end_time |
------------------- -------------------
|2022-04-07 00:00:00|2022-04-07 23:59:59|
------------------- -------------------