I'm trying to obtain the following:
--------------------
|work_time | day_shift|
--------------------
| 00:45:40 | No |
| 10:05:47 | Yes |
| 15:25:28 | Yes |
| 19:38:52 | No |
where I classify the "work_time" into "day_shift".
"Yes" - if the time falls between 09:00:00 and 18:00:00
"No" - otherwise
My "work_time" is in datetime format showing only the time. I tried the following, but I'm just getting "No" for everything.
df = df.withColumn('day_shift', when(df.work_time >= to_timestamp(lit('09:00:00'), 'HH:mm:ss') & df.work_time <= to_timestamp(lit('18:00:00'), 'Yes').otherwise('No'))
CodePudding user response:
You can use Column
class method between
. It works for both, timestamps and strings in format "HH:mm:ss"
. Use this:
F.col("work_time").between("09:00:00", "18:00:00")
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([('00:45:40',), ('10:05:47',), ('15:25:28',), ('19:38:52',)], ['work_time'])
day_shift = F.col("work_time").between("09:00:00", "18:00:00")
df = df.withColumn("day_shift", F.when(day_shift, "Yes").otherwise("No"))
df.show()
# --------- ---------
# |work_time|day_shift|
# --------- ---------
# | 00:45:40| No|
# | 10:05:47| Yes|
# | 15:25:28| Yes|
# | 19:38:52| No|
# --------- ---------
CodePudding user response:
First of all, spark doesn't have so-called "Time" data type, it only supports either TimestampType
or DateType
. Therefore, I believe the work_time
in your dataframe is a string.
Secondly, when you check your func.to_timestamp(func.lit('09:00:00'), 'HH:mm:ss')
in selection statement, it will show:
--------------------------------
|to_timestamp(09:00:00, HH:mm:ss)|
--------------------------------
|1970-01-01 09:00:00 |
--------------------------------
only showing top 1 row
The best way to achieve is either split your work_time
column to hour
, minute
and second
column respectively and do the filtering, or add a date value in your work_time
column before any timestamp filtering.