When I use week of year in the to_date
function, the first week of the year returns a null for many years.
df=pyrasa.sparkSession.createDataFrame([["2013-01"],["2013-02"],["2017-01"],["2018-01"]],["input"])
df.select(func.col("input"),func.to_date(func.col("input"),"yyyy-ww").alias("date")) \
.show()
------- ----------
| input| date|
------- ----------
|2013-01| null|
|2013-02|2013-01-06|
|2017-01|2017-01-01|
|2018-01| null|
------- ----------
Why is this? Is it a bug in the to_date
function?
CodePudding user response:
Please try
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df.withColumn('date', expr("date(input)")).show()
------- ----------
| input| date|
------- ----------
|2013-01|2013-01-01|
|2013-02|2013-02-01|
|2017-01|2017-01-01|
|2018-01|2018-01-01|
------- ----------
CodePudding user response:
You could try the following pandas_udf
. It works using the latest Spark version, but I can see from the result that you use earlier than Spark 3 version. I used this function, because in some cases there's no way to do it even in Spark 3. Spark move from version 2.4 to 3 changed quite much in datetime.
format
tweaking may be required in your case if the function works at all. Here are the format codes.
from pyspark.sql import functions as F
import pandas as pd
@F.pandas_udf('date')
def year_week_to_date(year_week: pd.Series) -> pd.Series:
return pd.to_datetime(year_week '-1', format='%G-%V-%u')
df = df.withColumn('date', year_week_to_date(F.col('input')))
df.show()
# ------- ----------
# | input| date|
# ------- ----------
# |2013-01|2012-12-31|
# |2013-02|2013-01-07|
# |2017-01|2017-01-02|
# |2018-01|2018-01-01|
# ------- ----------