Home > Blockchain >  Converting PySpark's consecutive withColumn to SQL
Converting PySpark's consecutive withColumn to SQL

Time:04-29

I need help in converting the below function into an SQL query:

start_time :- 1649289600
end_time :- 1649375999

test_data =  df.withColumn("from_timestamp",to_timestamp(lit(from_unixtime(col("start_time"),'MM-dd-yyyy HH:mm:ss:SSS')), 'MM-dd-yyyy HH:mm:ss:SSS')) \
    .withColumn("to_timestamp",to_timestamp(lit(from_unixtime(col("end_time"),"MM-dd-yyyy HH:mm:ss:SSS")), 'MM-dd-yyyy HH:mm:ss:SSS')) \
    .withColumn("DiffInSeconds", col("from_timestamp").cast("long") - col("to_timestamp").cast("long")) \
    .withColumn("time_diff_hr", abs(ceil(col("DiffInSeconds")/3600))) \
    .withColumn("minDiffInHours", col("DiffInSeconds")/3600 - ceil(col("DiffInSeconds")/3600)) \
    .withColumn("time_diff_min",abs(ceil(col("minDiffInHours")*60))) \
    .withColumn("minDiffInSec", col("minDiffInHours")*60 - ceil(col("minDiffInHours")*60)) \
    .withColumn("time_diff_sec",abs(ceil(col("minDiffInSec")*60)))

I have tried quite a few things like:

 df2=sqlContext.sql("SELECT start,end,from_unixtime(cast(start as string), 'MM-dd-yyyy HH:mm:ss:SSS') AS start_date,from_unixtime(cast(end as string), 'MM-dd-yyyy HH:mm:ss:SSS') AS end_date from data1")
 df2.show()
 df2.createOrReplaceTempView('data2')                        
 df4=sqlContext.sql("SELECT start,end,start_date,end_date,(end_date-start_date) as time_diff from data2")

But whenever I try to find the difference it is returning null values.

Edit:

Output for data1

 --- --- ---- ---------- ---------- -------------------- -------------------- 
|  a|  b|   c|     start|       end|          start_date|            end_date|
 --- --- ---- ---------- ---------- -------------------- -------------------- 
|  1|4.0|GFG1|1649289600|1649375999|04-07-2022 00:00:...|04-07-2022 23:59:...|
 --- --- ---- ---------- ---------- -------------------- -------------------- 

CodePudding user response:

In SQL's SELECT you can only reference existing columns. PySpark's withColumn works differently, as it basically creates a new df with additional column, so later you can use that column. Since in SQL you cannot, you must create every new column from start and date columns:

df2 = sqlContext.sql(
    """
    select
        start,
        end,
        from_unixtime(start, 'yyyy-MM-dd HH:mm:ss') as from_timestamp,
        from_unixtime(end, 'yyyy-MM-dd HH:mm:ss') as to_timestamp,
        (start - end) as DiffInSeconds,
        abs(ceil((start - end) / 3600)) as time_diff_hr,
        ((start - end) / 3600) - ceil((start - end) / 3600) as minDiffInHours,
        abs(ceil((((start - end) / 3600) - ceil((start - end) / 3600)) * 60)) as time_diff_min,
        (((start - end) / 3600) - ceil((start - end) / 3600)) * 60 - ceil((((start - end) / 3600) - ceil((start - end) / 3600)) * 60) as minDiffInSec,
        abs(ceil(((((start - end) / 3600) - ceil((start - end) / 3600)) * 60 - ceil((((start - end) / 3600) - ceil((start - end) / 3600)) * 60)) * 60)) as time_diff_sec
    from data1
    """
)
df2.show()
#  ---------- ---------- ------------------- ------------------- ------------- ------------ ------------------- ------------- ------------------- ------------- 
# |     start|       end|     from_timestamp|       to_timestamp|DiffInSeconds|time_diff_hr|     minDiffInHours|time_diff_min|       minDiffInSec|time_diff_sec|
#  ---------- ---------- ------------------- ------------------- ------------- ------------ ------------------- ------------- ------------------- ------------- 
# |1649289600|1649375999|2022-04-07 00:00:00|2022-04-07 23:59:59|       -86399|          23|-0.9997222222222213|           59|-0.9833333333332774|           58|
#  ---------- ---------- ------------------- ------------------- ------------- ------------ ------------------- ------------- ------------------- ------------- 

CodePudding user response:

If the requirement is only to convert into timestamp. See below logic -

spark.sql("select timestamp(from_unixtime(1649289600,'yyyy-MM-dd HH:mm:ss')) as start_time, timestamp(from_unixtime(1649375999,'yyyy-MM-dd HH:mm:ss')) as end_time" ).show(truncate=False)

 ------------------- ------------------- 
|start_time         |end_time           |
 ------------------- ------------------- 
|2022-04-07 00:00:00|2022-04-07 23:59:59|
 ------------------- ------------------- 
  • Related