I have such a pyspark DataFrames:
df1:
-------------------- ---------- ---------- ---------- ---------- ---------- ------ ----------------- --------
| NAME | X_NAME | BEGIN | END | A| B| C| D| E|
-------------------- ---------- ---------- ---------- ---------- ---------- ------ ----------------- --------
|whatever1 | XYZ|2021-09-27|2021-10-03| 0.0| 1.0| 0.0| 0.0| 0.0|
|whatever2 | XYZ|2021-09-27|2021-10-03| 0.0| 1.0| 0.0| 0.0| 0.0|
|whatever3 | XYZ|2021-10-04|2021-10-10| 0.0| 1.0| 0.0| 0.0| 0.0|
|whatever4 | XYZ|2021-10-04|2021-10-10| 0.0| 1.0| 0.0| 0.0| 0.0|
|whatever6 | XYZ|2021-10-18|2021-10-24| 0.0| 0.0| 1.0| 0.0| 0.0|
|whatever9 | XYZ|2021-10-25|2021-10-31| 0.0| 1.0| 0.0| 0.0| 0.0|
...
...
...
df2:
------------------- ----- ---- -------
| start_of_week|month|year|week_no|
------------------- ----- ---- -------
|2021-12-06 00:00:00| 12|2021|2021W49|
|2021-12-13 00:00:00| 12|2021|2021W50|
|2021-12-20 00:00:00| 12|2021|2021W51|
|2021-12-27 00:00:00| 12|2021|2021W52|
|2022-01-03 00:00:00| 1|2022| 2022W1|
|2022-01-10 00:00:00| 1|2022| 2022W2|
|2022-01-17 00:00:00| 1|2022| 2022W3|
|2022-01-24 00:00:00| 1|2022| 2022W4|
|2022-01-31 00:00:00| 2|2022| 2022W5|
|2022-02-07 00:00:00| 2|2022| 2022W6|
|2020-11-16 00:00:00| 11|2020|2020W47|
|2020-11-23 00:00:00| 11|2020|2020W48|
|2020-11-30 00:00:00| 12|2020|2020W49|
|2020-12-07 00:00:00| 12|2020|2020W50|
|2020-12-14 00:00:00| 12|2020|2020W51|
|2020-12-21 00:00:00| 12|2020|2020W52|
|2020-12-28 00:00:00| 12|2020|2020W53|
|2021-01-04 00:00:00| 1|2021| 2021W1|
|2021-01-11 00:00:00| 1|2021| 2021W2|
|2020-07-06 00:00:00| 7|2020|2020W28|
|2020-07-13 00:00:00| 7|2020|2020W29|
|2020-07-20 00:00:00| 7|2020|2020W30|
|2020-07-27 00:00:00| 7|2020|2020W31|
|2020-08-03 00:00:00| 8|2020|2020W32|
|2020-08-10 00:00:00| 8|2020|2020W33|
|2020-08-17 00:00:00| 8|2020|2020W34|
|2020-08-24 00:00:00| 8|2020|2020W35|
|2020-08-31 00:00:00| 9|2020|2020W36|
|2020-09-07 00:00:00| 9|2020|2020W37|
|2021-03-22 00:00:00| 3|2021|2021W12|
|2021-03-29 00:00:00| 4|2021|2021W13|
|2021-04-05 00:00:00| 4|2021|2021W14|
|2021-04-12 00:00:00| 4|2021|2021W15|
|2021-04-19 00:00:00| 4|2021|2021W16|
|2021-04-26 00:00:00| 4|2021|2021W17|
|2021-05-03 00:00:00| 5|2021|2021W18|
|2021-05-10 00:00:00| 5|2021|2021W19|
|2021-05-17 00:00:00| 5|2021|2021W20|
|2021-05-24 00:00:00| 5|2021|2021W21|
|2022-08-22 00:00:00| 8|2022|2022W34|
|2022-08-29 00:00:00| 9|2022|2022W35|
|2022-09-05 00:00:00| 9|2022|2022W36|
|2022-09-12 00:00:00| 9|2022|2022W37|
|2022-09-19 00:00:00| 9|2022|2022W38|
|2022-09-26 00:00:00| 9|2022|2022W39|
|2022-10-03 00:00:00| 10|2022|2022W40|
|2022-10-10 00:00:00| 10|2022|2022W41|
|2022-10-17 00:00:00| 10|2022|2022W42|
|2022-10-24 00:00:00| 10|2022|2022W43|
|2020-09-14 00:00:00| 9|2020|2020W38|
|2020-09-21 00:00:00| 9|2020|2020W39|
|2020-09-28 00:00:00| 10|2020|2020W40|
|2020-10-05 00:00:00| 10|2020|2020W41|
|2020-10-12 00:00:00| 10|2020|2020W42|
|2020-10-19 00:00:00| 10|2020|2020W43|
|2020-10-26 00:00:00| 10|2020|2020W44|
|2020-11-02 00:00:00| 11|2020|2020W45|
|2020-11-09 00:00:00| 11|2020|2020W46|
|2020-05-04 00:00:00| 5|2020|2020W19|
|2020-05-11 00:00:00| 5|2020|2020W20|
|2020-05-18 00:00:00| 5|2020|2020W21|
|2020-05-25 00:00:00| 5|2020|2020W22|
|2020-06-01 00:00:00| 6|2020|2020W23|
|2020-06-08 00:00:00| 6|2020|2020W24|
|2020-06-15 00:00:00| 6|2020|2020W25|
|2020-06-22 00:00:00| 6|2020|2020W26|
|2020-06-29 00:00:00| 7|2020|2020W27|
|2021-10-04 00:00:00| 10|2021|2021W40|
|2021-10-11 00:00:00| 10|2021|2021W41|
|2021-10-18 00:00:00| 10|2021|2021W42|
|2021-10-25 00:00:00| 10|2021|2021W43|
|2021-11-01 00:00:00| 11|2021|2021W44|
|2021-11-08 00:00:00| 11|2021|2021W45|
|2021-11-15 00:00:00| 11|2021|2021W46|
|2021-11-22 00:00:00| 11|2021|2021W47|
|2021-11-29 00:00:00| 12|2021|2021W48|
|2022-02-14 00:00:00| 2|2022| 2022W7|
|2022-02-21 00:00:00| 2|2022| 2022W8|
|2022-02-28 00:00:00| 3|2022| 2022W9|
|2022-03-07 00:00:00| 3|2022|2022W10|
|2022-03-14 00:00:00| 3|2022|2022W11|
|2022-03-21 00:00:00| 3|2022|2022W12|
|2022-03-28 00:00:00| 3|2022|2022W13|
|2022-04-04 00:00:00| 4|2022|2022W14|
|2022-04-11 00:00:00| 4|2022|2022W15|
|2022-04-18 00:00:00| 4|2022|2022W16|
|2022-04-25 00:00:00| 4|2022|2022W17|
|2022-05-02 00:00:00| 5|2022|2022W18|
|2022-05-09 00:00:00| 5|2022|2022W19|
|2022-05-16 00:00:00| 5|2022|2022W20|
|2022-05-23 00:00:00| 5|2022|2022W21|
|2022-05-30 00:00:00| 6|2022|2022W22|
|2022-06-06 00:00:00| 6|2022|2022W23|
|2022-06-13 00:00:00| 6|2022|2022W24|
|2022-06-20 00:00:00| 6|2022|2022W25|
|2022-06-27 00:00:00| 6|2022|2022W26|
|2022-07-04 00:00:00| 7|2022|2022W27|
|2022-07-11 00:00:00| 7|2022|2022W28|
|2022-07-18 00:00:00| 7|2022|2022W29|
|2022-07-25 00:00:00| 7|2022|2022W30|
|2022-08-01 00:00:00| 8|2022|2022W31|
|2022-08-08 00:00:00| 8|2022|2022W32|
|2022-08-15 00:00:00| 8|2022|2022W33|
|2021-01-18 00:00:00| 1|2021| 2021W3|
|2021-01-25 00:00:00| 1|2021| 2021W4|
|2021-02-01 00:00:00| 2|2021| 2021W5|
|2021-02-08 00:00:00| 2|2021| 2021W6|
|2021-02-15 00:00:00| 2|2021| 2021W7|
|2021-02-22 00:00:00| 2|2021| 2021W8|
|2021-03-01 00:00:00| 3|2021| 2021W9|
|2021-03-08 00:00:00| 3|2021|2021W10|
|2021-03-15 00:00:00| 3|2021|2021W11|
|2020-03-02 00:00:00| 3|2020|2020W10|
|2020-03-09 00:00:00| 3|2020|2020W11|
|2020-03-16 00:00:00| 3|2020|2020W12|
|2020-03-23 00:00:00| 3|2020|2020W13|
|2020-03-30 00:00:00| 4|2020|2020W14|
|2020-04-06 00:00:00| 4|2020|2020W15|
|2020-04-13 00:00:00| 4|2020|2020W16|
|2020-04-20 00:00:00| 4|2020|2020W17|
|2020-04-27 00:00:00| 4|2020|2020W18|
|2021-05-31 00:00:00| 6|2021|2021W22|
|2021-06-07 00:00:00| 6|2021|2021W23|
|2021-06-14 00:00:00| 6|2021|2021W24|
|2021-06-21 00:00:00| 6|2021|2021W25|
|2021-06-28 00:00:00| 7|2021|2021W26|
|2021-07-05 00:00:00| 7|2021|2021W27|
|2021-07-12 00:00:00| 7|2021|2021W28|
|2021-07-19 00:00:00| 7|2021|2021W29|
|2021-07-26 00:00:00| 7|2021|2021W30|
|2021-08-02 00:00:00| 8|2021|2021W31|
|2021-08-09 00:00:00| 8|2021|2021W32|
|2021-08-16 00:00:00| 8|2021|2021W33|
|2021-08-23 00:00:00| 8|2021|2021W34|
|2021-08-30 00:00:00| 9|2021|2021W35|
|2021-09-06 00:00:00| 9|2021|2021W36|
|2021-09-13 00:00:00| 9|2021|2021W37|
|2021-09-20 00:00:00| 9|2021|2021W38|
|2021-09-27 00:00:00| 9|2021|2021W39|
|2019-12-30 00:00:00| 1|2020| 2020W1|
|2020-01-06 00:00:00| 1|2020| 2020W2|
|2020-01-13 00:00:00| 1|2020| 2020W3|
|2020-01-20 00:00:00| 1|2020| 2020W4|
|2020-01-27 00:00:00| 1|2020| 2020W5|
|2020-02-03 00:00:00| 2|2020| 2020W6|
|2020-02-10 00:00:00| 2|2020| 2020W7|
|2020-02-17 00:00:00| 2|2020| 2020W8|
|2020-02-24 00:00:00| 2|2020| 2020W9|
------------------- ----- ---- -------
I would like to divide these BEGIN
and END
ranges to smaller units - week numbers from the second DataFrame. So final DataFrame would have only week_no
column instead of BEGIN
and END
. If the range is wider than one week, record would be multiplied to have more than one week number.
For example.:
-------------------- ---------- ---------- ---------- ---------- ---------- ------ ----------------- --------
| NAME | X_NAME | BEGIN | END | A| B| C| D| E|
-------------------- ---------- ---------- ---------- ---------- ---------- ------ ----------------- --------
|whatever345 | XYZ|2021-12-07|2021-12-14| 0.0| 1.0| 0.0| 0.0| 0.0|
Would be:
-------------------- ---------- ---------- ---------- ---------- ------ ----------------- --------
| NAME | X_NAME | week_no | A| B| C| D| E|
-------------------- ---------- ---------- ---------- ---------- ------ ----------------- --------
|whatever345 | XYZ| 2021W49| 0.0| 1.0| 0.0| 0.0| 0.0|
|whatever345 | XYZ| 2021W50| 0.0| 1.0| 0.0| 0.0| 0.0|
CodePudding user response:
you can use string format time from datetime module to fetch the weeknumber in the format you needed.
from datetime import date
#this will provide the format you want
date.strftime("%YW%W")
CodePudding user response:
Use the datetime functions to find the year and week number and create a series to fill week number and year in "begin" and "end" range. Then explode this series:
Full example:
df = spark.createDataFrame(data=[
["whatever1","XYZ","2021-09-27","2021-10-03",0.0,1.0,0.0,0.0,0.0],
["whatever2","XYZ","2021-09-27","2021-10-03",0.0,1.0,0.0,0.0,0.0],
["whatever3","XYZ","2021-10-04","2021-10-10",0.0,1.0,0.0,0.0,0.0],
["whatever4","XYZ","2021-10-04","2021-10-10",0.0,1.0,0.0,0.0,0.0],
["whatever6","XYZ","2021-10-18","2021-10-24",0.0,0.0,1.0,0.0,0.0],
["whatever9","XYZ","2021-10-25","2021-10-31",0.0,1.0,0.0,0.0,0.0],
["whatever100","XYZ","2021-12-20","2022-01-10",0.9,1.9,0.9,0.9,0.9],
], schema=["NAME","X_NAME","BEGIN","END","A","B","C","D","E"])
@F.udf(returnType=ArrayType(StringType()))
def week_range(begin, end):
from datetime import datetime
begin_dt = datetime.strptime(begin, "%Y-%m-%d").date()
end_dt = datetime.strptime(end, "%Y-%m-%d").date()
if begin_dt.year == end_dt.year:
return [f"{begin_dt.year}W{x}" for x in range(begin_dt.isocalendar()[1], end_dt.isocalendar()[1] 1)]
elif begin_dt.year < end_dt.year:
return [f"{begin_dt.year}W{x}" for x in range(begin_dt.isocalendar()[1], datetime.strptime(f"{begin_dt.year}-12-31", "%Y-%m-%d").date().isocalendar()[1] 1)] \
[f"{end_dt.year}W{x}" for x in range(1, end_dt.isocalendar()[1] 1)]
df = df.withColumn("week_no", week_range("BEGIN", "END"))
df = df.withColumn("week_no", F.explode("week_no"))
df = df.drop("BEGIN", "END")
Output:
----------- ------ --- --- --- --- --- -------
|NAME |X_NAME|A |B |C |D |E |week_no|
----------- ------ --- --- --- --- --- -------
|whatever1 |XYZ |0.0|1.0|0.0|0.0|0.0|2021W39|
|whatever2 |XYZ |0.0|1.0|0.0|0.0|0.0|2021W39|
|whatever3 |XYZ |0.0|1.0|0.0|0.0|0.0|2021W40|
|whatever4 |XYZ |0.0|1.0|0.0|0.0|0.0|2021W40|
|whatever6 |XYZ |0.0|0.0|1.0|0.0|0.0|2021W42|
|whatever9 |XYZ |0.0|1.0|0.0|0.0|0.0|2021W43|
|whatever100|XYZ |0.9|1.9|0.9|0.9|0.9|2021W51|
|whatever100|XYZ |0.9|1.9|0.9|0.9|0.9|2021W52|
|whatever100|XYZ |0.9|1.9|0.9|0.9|0.9|2022W1 |
|whatever100|XYZ |0.9|1.9|0.9|0.9|0.9|2022W2 |
----------- ------ --- --- --- --- --- -------
PS - In the last record, I have tested the corner case of week near year end and new year.