Home > OS >  Week start and end dates within a month in Spark 2.4
Week start and end dates within a month in Spark 2.4

Time:08-03

Requirement: find out week start date and end date, but only within the given month date.

Input df:

df = spark.createDataFrame([('1', '2022-07-01'), ('2', '2022-07-30'), ('3', '2022-07-31')], ["id", "date"])
df.createOrReplaceTempView("date_table")

#  --- ---------- 
# | id|      date|
#  --- ---------- 
# | 1 |2022-07-01|
# | 2 |2022-07-30|
# | 3 |2022-07-31|
#  --- ---------- 

Expected output:

 --- ---------- ----------- ----------- 
| id|      date|week_start | week_end  |
 --- ---------- ----------- ----------- 
| 1 |2022-07-01|2022-07-01 |2022-07-02 |
| 2 |2022-07-30|2022-07-24 |2022-07-30 |
| 3 |2022-07-31|2022-07-31 |2022-07-31 |
 --- ---------- ----------- ----------- 

The below query does not work, as it goes out of the boundaries of the given month (row id 1 and 3).

spark.sql(""" select id,date,date_sub(date,dayofweek-1) as week_start,
                date_add(date, 7-dayofweek) as week_end FROM
                (select id,date,dayofweek(date) as dayofweek
                FROM date_table) T""")

CodePudding user response:

In order to get the expected output, together with what you already have, you could use greatest and least, providing the start and end of month respectively.

SELECT id,
       date,
       greatest(date_sub(date, dayofweek-1), trunc(date, 'MM')) as week_start,
       least(date_add(date, 7-dayofweek), last_day(date)) as week_end
FROM
    (SELECT id, date, dayofweek(date) as dayofweek
     FROM date_table) T

Full example:

df = spark.createDataFrame([('1', '2022-07-01'), ('2', '2022-07-30'), ('3', '2022-07-31')], ["id", "date"])
df.createOrReplaceTempView("date_table")

df = spark.sql(
    """
    SELECT id,
           date,
           greatest(date_sub(date, dayofweek-1), trunc(date, 'MM')) as week_start,
           least(date_add(date, 7-dayofweek), last_day(date)) as week_end
    FROM
        (SELECT id, date, dayofweek(date) as dayofweek
         FROM date_table) T
    """
)
df.show()
#  --- ---------- ---------- ---------- 
# | id|      date|week_start|  week_end|
#  --- ---------- ---------- ---------- 
# |  1|2022-07-01|2022-07-01|2022-07-02|
# |  2|2022-07-30|2022-07-24|2022-07-30|
# |  3|2022-07-31|2022-07-31|2022-07-31|
#  --- ---------- ---------- ---------- 
  • Related