I have a column which is in the "20130623" format. I am trying to convert it into dd-mm-YYYY. I have seen various post online including here. But I only got one solution as below
from datetime import datetime
df = df2.withColumn("col_name", datetime.utcfromtimestamp(int("col_name")).strftime('%d-%m-%y'))
However, it throws an error that the input should be int type
, not the string type
. I tried to convert with the help of int()
function. But even that doesn't seem to be helping.
below is the error that I see when converting
invalid literal for int() with base 10: 'col_name'
I am not sure if it is taking the col_name as string, or its value as string. Please suggest, how can I do this, or the best way to get the required output
Note: I cannot use pandas in my environment.
thank you.
CodePudding user response:
withColumn() should use Spark types. You can not use python datetime there.
First parse string to date using to_date and then format date to string using date_format:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[["20130623"]], schema=["date_str"])
df = df.withColumn("date", F.to_date("date_str", format="yyyyMMdd"))
df = df.withColumn("date_ddMMyyyy", F.date_format("date", format="dd-MM-yyyy"))
df.show(truncate=False)
df.printSchema()
Output:
-------- ---------- -------------
|date_str|date |date_ddMMyyyy|
-------- ---------- -------------
|20130623|2013-06-23|23-06-2013 |
-------- ---------- -------------
root
|-- date_str: string (nullable = true)
|-- date: date (nullable = true)
|-- date_ddMMyyyy: string (nullable = true)
Whatever you see as output 2013-06-23
in date
column is actually default to_string() representation of date.