Home > Enterprise >  Conditional forward fill values in Spark dataframe
Conditional forward fill values in Spark dataframe

Time:07-30

I have a Spark dataframe where I need to create a window partition column ("desired_output"). I simply want this conditional column to equal the "flag" column (0) until the first true or 1 and then forward fill true or 1 forward throughout the partition ("user_id"). I've tried many different window partition variations (rowsBetween) but to no avail. Any help or guidance is greatly appreciated.

Here are some use cases and their "desired output":

columns = ['user_id', 'date', 'flag', 'desired_outcome']
data = [\
        ('1','2022-01-01', 0, 0),\
        ('1','2022-01-02', 0, 0),\
        ('1','2022-01-03', 0, 0),\
        ('1','2022-01-04', 0, 0),\
        ('1','2022-01-05', 1, 1),\
        ('1','2022-01-06', 0, 1),\
        ('1','2022-01-07', 0, 1),\
        ('2','2022-01-01', 0, 0),\
        ('2','2022-01-02', 0, 0),\
        ('2','2022-01-03', 0, 0),\
        ('2','2022-01-04', 0, 0),\
        ('2','2022-01-05', 0, 0),\
        ('2','2022-01-06', 1, 1),\
        ('2','2022-01-07', 0, 1),\
        ('2','2022-01-08', 0, 1),\
        ('2','2022-01-09', 0, 1),\
        ('2','2022-01-10', 0, 1),\
        ('2','2022-01-11', 0, 1),\
        ('2','2022-01-12', 0, 1)]

sample_df = spark.createDataFrame(data, columns)

CodePudding user response:

We can solve this particular case (as there are just two flag values 0 and 1) by using a sliding max() window.

data_sdf. \
    withColumn('so_answer',
               func.max('flag').over(wd.partitionBy('user_id').orderBy('date').rowsBetween(-sys.maxsize, 0))
               ). \
    show()

#  ------- ---------- ---- --------------- --------- 
# |user_id|      date|flag|desired_outcome|so_answer|
#  ------- ---------- ---- --------------- --------- 
# |      1|2022-01-01|   0|              0|        0|
# |      1|2022-01-02|   0|              0|        0|
# |      1|2022-01-03|   0|              0|        0|
# |      1|2022-01-04|   0|              0|        0|
# |      1|2022-01-05|   1|              1|        1|
# |      1|2022-01-06|   0|              1|        1|
# |      1|2022-01-07|   0|              1|        1|
# |      2|2022-01-01|   0|              0|        0|
# |      2|2022-01-02|   0|              0|        0|
# |      2|2022-01-03|   0|              0|        0|
# |      2|2022-01-04|   0|              0|        0|
# |      2|2022-01-05|   0|              0|        0|
# |      2|2022-01-06|   1|              1|        1|
# |      2|2022-01-07|   0|              1|        1|
# |      2|2022-01-08|   0|              1|        1|
# |      2|2022-01-09|   0|              1|        1|
# |      2|2022-01-10|   0|              1|        1|
# |      2|2022-01-11|   0|              1|        1|
# |      2|2022-01-12|   0|              1|        1|
#  ------- ---------- ---- --------------- --------- 
  • Related