Converting string to exact date which identifies the correct year-CodePudding

I have data like this:

 --- ------                                                                     
| id|   col|
 --- ------ 
|  1|210927|
|  2|210928|
|  3|210929|
|  4|210930|
|  5|211001|
 --- ------

I want the output like below:

 --- ------ ---------- 
| id|   col|   t_date1|
 --- ------ ---------- 
|  1|210927|27-09-2021|
|  2|210928|28-09-2021|
|  3|210929|29-09-2021|
|  4|210930|30-09-2021|
|  5|211001|01-10-2021|
 --- ------ ----------

Which I was able to get it using pandas and strptime. Below is my code:

pDF= df.toPandas()
valuesList = pDF['col'].to_list()
modifiedList = list()
 
for i in valuesList:
...  modifiedList.append(datetime.strptime(i, "%y%m%d").strftime('%d-%m-%Y'))
 
pDF['t_date1']=modifiedList
 
df = spark.createDataFrame(pDF)

Now, the main problem is I want to avoid using pandas and list since I would be dealing with millions or even billions of data, and pandas slowers the process when it comes to big data.

I tried various methods in spark like unixtime, to_date, timestamp with the format I need but no luck, and since strptime only works with string I can't use it directly on column. I am not willing to create a UDF since they are slow too.

The main problem is with identifying the exact year which I wasn't able to do in spark but I am looking to implement it using spark only. What needs to be changed? Where am I going wrong?

CodePudding user response：

Did you use the correct format ? Use yyMMdd and to_date for parsing, dd-MM-yyyy and date_format for formatting should work:

import pyspark.sql.functions as f
df.withColumn('t_date', f.date_format(f.to_date('col', 'yyMMdd'), 'dd-MM-yyyy')).show()

 --- ------ ---------- 
| id|   col|    t_date|
 --- ------ ---------- 
|  1|210927|27-09-2021|
|  2|210928|28-09-2021|
|  3|210929|29-09-2021|
|  4|210930|30-09-2021|
|  5|211001|01-10-2021|
 --- ------ ----------

if col is not of string type, cast to string first:

df.withColumn('t_date', f.date_format(f.to_date(f.col('col').cast('string'), 'yyMMdd'), 'dd-MM-yyyy')).show()

CodePudding user response：

Here is another way:

(df.assign(t_date1 = pd.to_datetime('20'   df['Col'].astype(str)
                                    ,format = '%Y/%m/%d').dt.strftime('%d-%m-%Y')))

CodePudding user response：

According to Python datetime.strptime

# Open Group specification for strptime() states that a %y
#value in the range of [00, 68] is in the century 2000, while
#[69,99] is in the century 1900
if year <= 68:
    year  = 2000
else:
    year  = 1900

It's easy enough to implement this with PySpark's when and otherwise

from pyspark.sql import functions as F

(df
    .withColumn('y', F.substring('col', 0, 2).cast('int'))
    .withColumn('y', F
        .when(F.col('y') <= 68, F.col('y')   2000)
        .otherwise(F.col('y')   1900)
    )
    .show()
)

# Output
#  --- ------ ---- 
# | id|   col|   y|
#  --- ------ ---- 
# |  1|210927|2021|
# |  2|910927|1991|
#  --- ------ ----

Technically, you can argue all day about this approach (0-68 then 69-99). But it's kind of "standard" here, so I don't see anything wrong with using it here.