Check if timestamp is inside range-CodePudding

I'm trying to obtain the following:

 -------------------- 
|work_time | day_shift|
 -------------------- 
| 00:45:40 |      No  |
| 10:05:47 |      Yes |
| 15:25:28 |      Yes |
| 19:38:52 |      No  |

where I classify the "work_time" into "day_shift".
"Yes" - if the time falls between 09:00:00 and 18:00:00
"No" - otherwise

My "work_time" is in datetime format showing only the time. I tried the following, but I'm just getting "No" for everything.

df = df.withColumn('day_shift', when(df.work_time >= to_timestamp(lit('09:00:00'), 'HH:mm:ss') & df.work_time <= to_timestamp(lit('18:00:00'), 'Yes').otherwise('No'))

CodePudding user response：

You can use Column class method between. It works for both, timestamps and strings in format "HH:mm:ss". Use this:

F.col("work_time").between("09:00:00", "18:00:00")

Full test:

from pyspark.sql import functions as F
df = spark.createDataFrame([('00:45:40',), ('10:05:47',), ('15:25:28',), ('19:38:52',)], ['work_time'])

day_shift = F.col("work_time").between("09:00:00", "18:00:00")
df = df.withColumn("day_shift", F.when(day_shift, "Yes").otherwise("No"))

df.show()
#  --------- --------- 
# |work_time|day_shift|
#  --------- --------- 
# | 00:45:40|       No|
# | 10:05:47|      Yes|
# | 15:25:28|      Yes|
# | 19:38:52|       No|
#  --------- ---------

CodePudding user response：

First of all, spark doesn't have so-called "Time" data type, it only supports either TimestampType or DateType. Therefore, I believe the work_time in your dataframe is a string.

Secondly, when you check your func.to_timestamp(func.lit('09:00:00'), 'HH:mm:ss') in selection statement, it will show:

 -------------------------------- 
|to_timestamp(09:00:00, HH:mm:ss)|
 -------------------------------- 
|1970-01-01 09:00:00             |
 -------------------------------- 
only showing top 1 row

The best way to achieve is either split your work_time column to hour, minute and second column respectively and do the filtering, or add a date value in your work_time column before any timestamp filtering.