Home > Software engineering >  adding 1 day to yyyyMMdd date format in pyspark
adding 1 day to yyyyMMdd date format in pyspark

Time:07-26

I have a data frame in pyspark like below

df = spark.createDataFrame([['20220725']], ['report_date'])

df.show()
 ----------- 
|report_date|
 ----------- 
|   20220725|
 ----------- 

Now I want to add one day to the above column value

expected result

 ---------- 
|new_date_1|
 ---------- 
|  20220726|
 ---------- 

I have tried like below

import pyspark.sql.functions as f

Convert above yyyyMMdd column to yyyy-MM-dd format

convert_df = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp(f.col("report_date"), 'yyyyMMdd'),'yyyy-MM-dd'))

convert_df.show()
 ----------- ---------- 
|report_date|  new_date|
 ----------- ---------- 
|   20220725|2022-07-25|
 ----------- ---------- 

Add one day to the column and then convert back to yyyyMMdd format

posting_df = convert_df.withColumn("new_posting_date", f.date_add(f.col("new_date"), 1)).withColumn("new_date_1", f.from_unixtime(f.unix_timestamp(f.col("new_posting_date"), 'yyyy-MM-dd'), 'yyyyMMdd'))

posting_df.show()
 ----------- ---------- ---------------- ---------- 
|report_date|  new_date|new_posting_date|new_date_1|
 ----------- ---------- ---------------- ---------- 
|   20220725|2022-07-25|      2022-07-26|  20220726|
 ----------- ---------- ---------------- ---------- 

I am able to achieve what I want but want to see if there is a better way.

CodePudding user response:

The main possible improvement I see is using to_date and date_format instead of unix_timestamp and from_unixtime.

Example code

df.withColumn("new_date",
  f.date_format(
    f.date_add(f.to_date(df.report_date, "yyyyMMdd"), 1),
    "yyyyMMdd"
  )
)

Link to documentation: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.to_date.html#pyspark.sql.functions.to_date

  • Related