Home > Net >  AWS Glue studio converting Pyspark string column to date returns null
AWS Glue studio converting Pyspark string column to date returns null

Time:09-10

I have data from an S3 bucket and want to convert the Date column from string to date. The current Date column is in the format 7/1/2022 12:0:15 AM.

Current code I am using in AWS Glue Studio to attempt the custom transformation:

MyTransform (glueContext, dfc) -> DynamicFrameCollection:
      from pyspark.sql.functions import col, to_timestamp
      df = dfc.select(list(dfc.keys())[0]).toDF()
      df = df.withColumn('Date',to_timestamp(col("Date"), 'MM/dd/yyyy HH:MM:SS'))
      df_res = DynamicFrame.fromDF(df, glueContext, "df")
      return(DynamicFrameCollection({"CustomTransform0": df_res}, glueContext))

With MM/dd/yyyy HH:MM:SS date formatting, it runs but returns null for the Date column. When I try any other date format besides this, it errors out. I suspect the date formatting may be the issue, but I am not certain.

CodePudding user response:

After converting string to timestamp you need to cast it to date type, like this:

df = df.withColumn(df_col, df[df_col].cast("date"))

CodePudding user response:

We ended up removing the HH:MM:SS portion of the date format and this worked for our needs. I would still be interested if anyone can figure out how to get the hours, minutes, seconds, and AM/PM to work, but we can do without for now.

  • Related