why am I not able to convert string type column to date format in pyspark?-CodePudding

I have a column which is in the "20130623" format. I am trying to convert it into dd-mm-YYYY. I have seen various post online including here. But I only got one solution as below

from datetime import datetime
df = df2.withColumn("col_name", datetime.utcfromtimestamp(int("col_name")).strftime('%d-%m-%y'))

However, it throws an error that the input should be int type, not the string type. I tried to convert with the help of int() function. But even that doesn't seem to be helping.

below is the error that I see when converting

invalid literal for int() with base 10: 'col_name'

I am not sure if it is taking the col_name as string, or its value as string. Please suggest, how can I do this, or the best way to get the required output

Note: I cannot use pandas in my environment.

thank you.

CodePudding user response：

withColumn() should use Spark types. You can not use python datetime there.

First parse string to date using to_date and then format date to string using date_format:

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[["20130623"]], schema=["date_str"])

df = df.withColumn("date", F.to_date("date_str", format="yyyyMMdd"))
df = df.withColumn("date_ddMMyyyy", F.date_format("date", format="dd-MM-yyyy"))

df.show(truncate=False)
df.printSchema()

Output:

 -------- ---------- ------------- 
|date_str|date      |date_ddMMyyyy|
 -------- ---------- ------------- 
|20130623|2013-06-23|23-06-2013   |
 -------- ---------- ------------- 

root
 |-- date_str: string (nullable = true)
 |-- date: date (nullable = true)
 |-- date_ddMMyyyy: string (nullable = true)

Whatever you see as output 2013-06-23 in date column is actually default to_string() representation of date.