Need to convert this format to
name,code,DATE_invoice
Ram,E01,09/29/2018
Mara,E02,07/14/2017
Test,E03,01/01/18
this:
name,code,DATE_invoice
Ram,E01,2018-09-29
Mara,E02,2017-07-14
Test,E03,2018-01-01
CodePudding user response:
If the column is already a date, this should do the job:
df = df.withColumn('DATE_invoice', date_format(col("DATE_invoice"), "yyyy-MM-dd")))
CodePudding user response:
To Parse different data formats you can utilise to_date along with coalesce
You can utilise the same approach towards multiple patterns within your dataset , and example can be found here
Data preparation
input_str = """
Ram,E01,09/29/2018,
Mara,E02,07/14/2017,
Test,E03,01/01/18
""".split(",")
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "name,code,DATE_invoice".split(",")))
n = len(input_values)
n_col = 3
input_list = [tuple(input_values[i:i n_col]) for i in range(0,n,n_col)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF.show()
---- ---- ------------
|name|code|DATE_invoice|
---- ---- ------------
| Ram| E01| 09/29/2018|
|Mara| E02| 07/14/2017|
|Test| E03| 01/01/18|
---- ---- ------------
To Date and Coalesce
sql.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
sparkDF.withColumn('p1',F.to_date(F.col('DATE_invoice'),"MM/dd/yyyy"))\
.withColumn('p2',F.to_date(F.col('DATE_invoice'),"MM/dd/yy"))\
.withColumn('DATE_invoice_parsed',F.coalesce(F.col('p1'),F.col('p2')))\
.drop(*['p1','p2'])\
.show(truncate=False)
---- ---- ------------ -------------------
|name|code|DATE_invoice|DATE_invoice_parsed|
---- ---- ------------ -------------------
|Ram |E01 |09/29/2018 |2018-09-29 |
|Mara|E02 |07/14/2017 |2017-07-14 |
|Test|E03 |01/01/18 |0018-01-01 |
---- ---- ------------ -------------------