I have a pyspark dataframe:
Location Month New_Date Sales
USA 1/1/2020 1/1/2020 34.56%
COL 1/1/2020 1/1/2020 66.4%
AUS 1/1/2020 1/1/2020 32.98%
NZ null null 44.59%
CHN null null 21.13%
Im creating New_Date
column from Month
column (MM/dd/yyyy format).
I need to populate New_date
values for the rows having Month
as null.
And this is what I tried:
df1=df.filter(col('Month').isNull()) \
.withColumn("current_date",current_date()) \
.withColumn("New_date", trunc(col("current_date"), "month"))
But Im getting first date of current month.
I need the first date of Month
column
Pls suggest any other approach.
Location Month New_Date Sales
USA 1/1/2020 1/1/2020 34.56%
COL 1/1/2020 1/1/2020 66.4%
AUS 1/1/2020 1/1/2020 32.98%
NZ null 1/1/2020 44.59%
CHN null 1/1/2020 21.13%
CodePudding user response:
You can use first
function over window:
from pyspark.sql import functions as F, Window
w = (Window.orderBy("Month")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
df1 = df.withColumn(
"New_date",
F.coalesce(F.col("Month"), F.first("Month", ignorenulls=True).over(w))
)
df1.show()
# -------- -------- -------- ------
#|Location| Month|New_date| Sales|
# -------- -------- -------- ------
#| NZ| null|1/1/2020|44.59%|
#| CHN| null|1/1/2020|21.13%|
#| USA|1/1/2020|1/1/2020|34.56%|
#| COL|1/1/2020|1/1/2020| 66.4%|
#| AUS|1/1/2020|1/1/2020|32.98%|
# -------- -------- -------- ------