I have a data frame in pyspark
like below
df = spark.createDataFrame([['20220725']], ['report_date'])
df.show()
-----------
|report_date|
-----------
| 20220725|
-----------
Now I want to add one day to the above column value
expected result
----------
|new_date_1|
----------
| 20220726|
----------
I have tried like below
import pyspark.sql.functions as f
Convert above yyyyMMdd
column to yyyy-MM-dd
format
convert_df = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp(f.col("report_date"), 'yyyyMMdd'),'yyyy-MM-dd'))
convert_df.show()
----------- ----------
|report_date| new_date|
----------- ----------
| 20220725|2022-07-25|
----------- ----------
Add one day to the column and then convert back to yyyyMMdd
format
posting_df = convert_df.withColumn("new_posting_date", f.date_add(f.col("new_date"), 1)).withColumn("new_date_1", f.from_unixtime(f.unix_timestamp(f.col("new_posting_date"), 'yyyy-MM-dd'), 'yyyyMMdd'))
posting_df.show()
----------- ---------- ---------------- ----------
|report_date| new_date|new_posting_date|new_date_1|
----------- ---------- ---------------- ----------
| 20220725|2022-07-25| 2022-07-26| 20220726|
----------- ---------- ---------------- ----------
I am able to achieve what I want but want to see if there is a better way.
CodePudding user response:
The main possible improvement I see is using to_date
and date_format
instead of unix_timestamp
and from_unixtime
.
Example code
df.withColumn("new_date",
f.date_format(
f.date_add(f.to_date(df.report_date, "yyyyMMdd"), 1),
"yyyyMMdd"
)
)
Link to documentation: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.to_date.html#pyspark.sql.functions.to_date