I am using Spark dataframes. I have a use case where I need to increment a date by one. I the incremental date happens to be weekend then I need to increment it to next week/working day.
val df = Seq(
("50312", "2021-12-01", "0.9992019"),
("50312", "2021-12-02", "0.20171201"),
("50312", "2021-12-03", "2.9992019")
).toDF("id","some_date","item_value")
.withColumn("nextworking_day", date_add(col("some_date"),1))
Next working day should be next weekday not weekend. How to do it?
CodePudding user response:
You could use dayofweek
to get the number of the weekday and add 2 if the day is Saturday or add 3 if the day is Friday.
val day = dayofweek(col("some_date"))
val nextworkday = col("some_date") when(day > 5, -day 9).otherwise(1)
val df = Seq(
("50312", "2021-12-01", "0.9992019"),
("50312", "2021-12-02", "0.20171201"),
("50312", "2021-12-03", "2.9992019")
).toDF("id","some_date","item_value")
.withColumn("some_date", col("some_date").cast("date"))
.withColumn("nextworking_day", nextworkday)
df.show()
----- ---------- ---------- ---------------
| id| some_date|item_value|nextworking_day|
----- ---------- ---------- ---------------
|50312|2021-12-01| 0.9992019| 2021-12-02|
|50312|2021-12-02|0.20171201| 2021-12-03|
|50312|2021-12-03| 2.9992019| 2021-12-06|
----- ---------- ---------- ---------------
CodePudding user response:
Writing an udf for checking the day should solve the problem Below is the sample code running in pyspark, does not contain code for holidays but you can create a List or an enum and add a condition based on your region
import pyspark.sql.functions as f
from pyspark.sql.types import TimestampType
from datetime import datetime, timedelta
@f.udf(returnType=TimestampType())
def get_convert_date_udf(date_column):
datetime_object = datetime.strptime(date_column, "%Y-%m-%d")
new_datetime_object = datetime_object timedelta(days=1)
day = new_datetime_object.strftime("%A")
if day == "Sunday":
new_datetime_object = timedelta(days=1)
elif day == "Saturday":
new_datetime_object = timedelta(days=2)
return new_datetime_object
df = df.withColumn("next_working_date",
get_convert_date_udf(f.col("some_date")))