Home > OS >  Pyspark can not convert date from "July 2020" to "01/07/2020
Pyspark can not convert date from "July 2020" to "01/07/2020

Time:07-20

I have a little issue with my code that does not return any error.

The conversion of the date doesn't work.

The date remains in the same format "yearmonth" = "202012".

I want to get this date : "01/12/2020" (only month and year can changed).

I can convert from columns C or D but I use column D in my case.

I have an issue in "DATE" area in the code below.

Pyspark code https://drive.google.com/file/d/1Q2cQ8HS_hqSLYJvJ57IINJVDCm9vnJzk/view?usp=sharing

Excel file https://drive.google.com/file/d/1qbZ7yfc8AxUDImM2sPgGdfO1L3G8GRr8/view?usp=sharing

Target in image https://drive.google.com/file/d/1BqLHOzHJEgKNLA1plK2Elx6y5VYmpfnw/view?usp=sharing

Error in DATE area https://drive.google.com/file/d/16UjjRfUDmknYjK622Abt5_sJ8_8o4gGL/view?usp=sharing

Thank you for your help.

CodePudding user response:

I am going to assume that your "date" column is all "Month as Word" and "Full Year" (eg 'July 2020') and that you are using pandas because of your tag. First, you need to convert your column to a datetime object, then you can convert the format:

import pandas as pd
date = ["July 2020", "July 2022", "March 2020", "July 2021"]
df = pd.DataFrame({"date":date})

# Convert to datetime, then change the format of the date
df.date = pd.to_datetime(df.date, format='%B %Y').dt.strftime('%d/%m/%Y')
df

Output:

date
01/07/2020
01/07/2022
01/03/2020
01/07/2021

This does make the contents in the cell a string however. So if you want to use them as dates, then remove the dt.strftime('%d/%m/%Y') part

CodePudding user response:

In pyspark, I've shown two ways to create a date field using the yearmonth string field. You can then use the created date field to format the date to your choice using the date_format function.

spark.sparkContext.parallelize([('202012',)]).toDF(['yearmonth']). \
    withColumn('dt1', 
               func.concat_ws('-', 
                              func.substring('yearmonth', 1, 4), 
                              func.substring('yearmonth', 5, 2), 
                              func.lit('01')
                              ).cast('date')
               ). \
    withColumn('dt2', func.to_date('yearmonth', 'yyyyMM').cast('date')). \
    withColumn('dt_formatted', func.date_format('dt2', 'dd/MM/yyyy')). \
    show()

#  --------- ---------- ---------- ------------ 
# |yearmonth|       dt1|       dt2|dt_formatted|
#  --------- ---------- ---------- ------------ 
# |   202012|2020-12-01|2020-12-01|  01/12/2020|
#  --------- ---------- ---------- ------------ 

Both dt1 and dt2 fields are same but created differently.

For the 'Jul 2020' format, use the to_date() function.

spark.sparkContext.parallelize([('Dec 2020',)]).toDF(['yearmonth']). \
    withColumn('dt2', func.to_date('yearmonth', 'MMM yyyy').cast('date')). \
    withColumn('dt_formatted', func.date_format('dt2', 'dd/MM/yyyy')). \
    show()

#  --------- ---------- ------------ 
# |yearmonth|       dt2|dt_formatted|
#  --------- ---------- ------------ 
# | Dec 2020|2020-12-01|  01/12/2020|
#  --------- ---------- ------------ 

For the 'July 2020' format also, use the to_date() function.

spark.sparkContext.parallelize([('December 2020',)]).toDF(['yearmonth']). \
    withColumn('dt2', func.to_date('yearmonth', 'MMMM yyyy').cast('date')). \
    withColumn('dt_formatted', func.date_format('dt2', 'dd/MM/yyyy')). \
    show()

#  ------------- ---------- ------------ 
# |    yearmonth|       dt2|dt_formatted|
#  ------------- ---------- ------------ 
# |December 2020|2020-12-01|  01/12/2020|
#  ------------- ---------- ------------ 

See spark compatible parsers for more.

  • Related