PySpark to_date function returns null for week 1 of the year-CodePudding

When I use week of year in the to_date function, the first week of the year returns a null for many years.

df=pyrasa.sparkSession.createDataFrame([["2013-01"],["2013-02"],["2017-01"],["2018-01"]],["input"])
df.select(func.col("input"),func.to_date(func.col("input"),"yyyy-ww").alias("date")) \
  .show()

 ------- ---------- 
|  input|      date|
 ------- ---------- 
|2013-01|      null|
|2013-02|2013-01-06|
|2017-01|2017-01-01|
|2018-01|      null|
 ------- ----------

Why is this? Is it a bug in the to_date function?

CodePudding user response：

Please try

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

df.withColumn('date', expr("date(input)")).show()

 ------- ---------- 
|  input|      date|
 ------- ---------- 
|2013-01|2013-01-01|
|2013-02|2013-02-01|
|2017-01|2017-01-01|
|2018-01|2018-01-01|
 ------- ----------

CodePudding user response：

You could try the following pandas_udf. It works using the latest Spark version, but I can see from the result that you use earlier than Spark 3 version. I used this function, because in some cases there's no way to do it even in Spark 3. Spark move from version 2.4 to 3 changed quite much in datetime.

format tweaking may be required in your case if the function works at all. Here are the format codes.

from pyspark.sql import functions as F
import pandas as pd
@F.pandas_udf('date')
def year_week_to_date(year_week: pd.Series) -> pd.Series:
    return pd.to_datetime(year_week   '-1', format='%G-%V-%u')

df = df.withColumn('date', year_week_to_date(F.col('input')))

df.show()
#  ------- ---------- 
# |  input|      date|
#  ------- ---------- 
# |2013-01|2012-12-31|
# |2013-02|2013-01-07|
# |2017-01|2017-01-02|
# |2018-01|2018-01-01|
#  ------- ----------