Assign date values for null in a column in a pyspark dataframe-CodePudding

I have a pyspark dataframe:

Location        Month       New_Date    Sales
USA             1/1/2020    1/1/2020    34.56%
COL             1/1/2020    1/1/2020    66.4%
AUS             1/1/2020    1/1/2020    32.98%
NZ              null        null        44.59%
CHN             null        null        21.13%

Im creating New_Date column from Month column (MM/dd/yyyy format). I need to populate New_date values for the rows having Month as null.

And this is what I tried:

df1=df.filter(col('Month').isNull()) \
.withColumn("current_date",current_date()) \
.withColumn("New_date", trunc(col("current_date"), "month"))

But Im getting first date of current month. I need the first date of Month column Pls suggest any other approach.

Location        Month       New_Date    Sales
USA             1/1/2020    1/1/2020    34.56%
COL             1/1/2020    1/1/2020    66.4%
AUS             1/1/2020    1/1/2020    32.98%
NZ              null        1/1/2020    44.59%
CHN             null        1/1/2020    21.13%

CodePudding user response：

You can use first function over window:

from pyspark.sql import functions as F, Window

w = (Window.orderBy("Month")
     .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
     )

df1 = df.withColumn(
    "New_date",
    F.coalesce(F.col("Month"), F.first("Month", ignorenulls=True).over(w))
)

df1.show()
# -------- -------- -------- ------ 
#|Location|   Month|New_date| Sales|
# -------- -------- -------- ------ 
#|      NZ|    null|1/1/2020|44.59%|
#|     CHN|    null|1/1/2020|21.13%|
#|     USA|1/1/2020|1/1/2020|34.56%|
#|     COL|1/1/2020|1/1/2020| 66.4%|
#|     AUS|1/1/2020|1/1/2020|32.98%|
# -------- -------- -------- ------