I have a little issue with my code that does not return any error.
The conversion of the date doesn't work.
The date remains in the same format "yearmonth" = "202012".
I want to get this date : "01/12/2020" (only month and year can changed).
I can convert from columns C or D but I use column D in my case.
I have an issue in "DATE" area in the code below.
Pyspark code https://drive.google.com/file/d/1Q2cQ8HS_hqSLYJvJ57IINJVDCm9vnJzk/view?usp=sharing
Excel file https://drive.google.com/file/d/1qbZ7yfc8AxUDImM2sPgGdfO1L3G8GRr8/view?usp=sharing
Target in image https://drive.google.com/file/d/1BqLHOzHJEgKNLA1plK2Elx6y5VYmpfnw/view?usp=sharing
Error in DATE area https://drive.google.com/file/d/16UjjRfUDmknYjK622Abt5_sJ8_8o4gGL/view?usp=sharing
Thank you for your help.
CodePudding user response:
I am going to assume that your "date" column is all "Month as Word" and "Full Year" (eg 'July 2020') and that you are using pandas because of your tag. First, you need to convert your column to a datetime object, then you can convert the format:
import pandas as pd
date = ["July 2020", "July 2022", "March 2020", "July 2021"]
df = pd.DataFrame({"date":date})
# Convert to datetime, then change the format of the date
df.date = pd.to_datetime(df.date, format='%B %Y').dt.strftime('%d/%m/%Y')
df
Output:
date
01/07/2020
01/07/2022
01/03/2020
01/07/2021
This does make the contents in the cell a string however. So if you want to use them as dates, then remove the dt.strftime('%d/%m/%Y')
part
CodePudding user response:
In pyspark, I've shown two ways to create a date field using the yearmonth
string field. You can then use the created date field to format the date to your choice using the date_format
function.
spark.sparkContext.parallelize([('202012',)]).toDF(['yearmonth']). \
withColumn('dt1',
func.concat_ws('-',
func.substring('yearmonth', 1, 4),
func.substring('yearmonth', 5, 2),
func.lit('01')
).cast('date')
). \
withColumn('dt2', func.to_date('yearmonth', 'yyyyMM').cast('date')). \
withColumn('dt_formatted', func.date_format('dt2', 'dd/MM/yyyy')). \
show()
# --------- ---------- ---------- ------------
# |yearmonth| dt1| dt2|dt_formatted|
# --------- ---------- ---------- ------------
# | 202012|2020-12-01|2020-12-01| 01/12/2020|
# --------- ---------- ---------- ------------
Both dt1
and dt2
fields are same but created differently.
For the 'Jul 2020'
format, use the to_date()
function.
spark.sparkContext.parallelize([('Dec 2020',)]).toDF(['yearmonth']). \
withColumn('dt2', func.to_date('yearmonth', 'MMM yyyy').cast('date')). \
withColumn('dt_formatted', func.date_format('dt2', 'dd/MM/yyyy')). \
show()
# --------- ---------- ------------
# |yearmonth| dt2|dt_formatted|
# --------- ---------- ------------
# | Dec 2020|2020-12-01| 01/12/2020|
# --------- ---------- ------------
For the 'July 2020'
format also, use the to_date()
function.
spark.sparkContext.parallelize([('December 2020',)]).toDF(['yearmonth']). \
withColumn('dt2', func.to_date('yearmonth', 'MMMM yyyy').cast('date')). \
withColumn('dt_formatted', func.date_format('dt2', 'dd/MM/yyyy')). \
show()
# ------------- ---------- ------------
# | yearmonth| dt2|dt_formatted|
# ------------- ---------- ------------
# |December 2020|2020-12-01| 01/12/2020|
# ------------- ---------- ------------
See spark compatible parsers for more.