I have data like this:
--- ------
| id| col|
--- ------
| 1|210927|
| 2|210928|
| 3|210929|
| 4|210930|
| 5|211001|
--- ------
I want the output like below:
--- ------ ----------
| id| col| t_date1|
--- ------ ----------
| 1|210927|27-09-2021|
| 2|210928|28-09-2021|
| 3|210929|29-09-2021|
| 4|210930|30-09-2021|
| 5|211001|01-10-2021|
--- ------ ----------
Which I was able to get it using pandas
and strptime
. Below is my code:
pDF= df.toPandas()
valuesList = pDF['col'].to_list()
modifiedList = list()
for i in valuesList:
... modifiedList.append(datetime.strptime(i, "%y%m%d").strftime('%d-%m-%Y'))
pDF['t_date1']=modifiedList
df = spark.createDataFrame(pDF)
Now, the main problem is I want to avoid
using pandas
and list
since I would be dealing with millions
or even billions
of data, and pandas slowers the process when it comes to big data.
I tried various methods in spark like unixtime
, to_date
, timestamp
with the format I need but no luck, and since strptime
only works with string I can't use it directly on column. I am not willing to create a UDF since they are slow too.
The main problem is with identifying the exact year which I wasn't able to do in spark but I am looking to implement it using spark only. What needs to be changed? Where am I going wrong?
CodePudding user response:
Did you use the correct format ? Use yyMMdd
and to_date
for parsing, dd-MM-yyyy
and date_format
for formatting should work:
import pyspark.sql.functions as f
df.withColumn('t_date', f.date_format(f.to_date('col', 'yyMMdd'), 'dd-MM-yyyy')).show()
--- ------ ----------
| id| col| t_date|
--- ------ ----------
| 1|210927|27-09-2021|
| 2|210928|28-09-2021|
| 3|210929|29-09-2021|
| 4|210930|30-09-2021|
| 5|211001|01-10-2021|
--- ------ ----------
if col
is not of string type, cast to string first:
df.withColumn('t_date', f.date_format(f.to_date(f.col('col').cast('string'), 'yyMMdd'), 'dd-MM-yyyy')).show()
CodePudding user response:
Here is another way:
(df.assign(t_date1 = pd.to_datetime('20' df['Col'].astype(str)
,format = '%Y/%m/%d').dt.strftime('%d-%m-%Y')))
CodePudding user response:
According to Python datetime.strptime
# Open Group specification for strptime() states that a %y
#value in the range of [00, 68] is in the century 2000, while
#[69,99] is in the century 1900
if year <= 68:
year = 2000
else:
year = 1900
It's easy enough to implement this with PySpark's when
and otherwise
from pyspark.sql import functions as F
(df
.withColumn('y', F.substring('col', 0, 2).cast('int'))
.withColumn('y', F
.when(F.col('y') <= 68, F.col('y') 2000)
.otherwise(F.col('y') 1900)
)
.show()
)
# Output
# --- ------ ----
# | id| col| y|
# --- ------ ----
# | 1|210927|2021|
# | 2|910927|1991|
# --- ------ ----
Technically, you can argue all day about this approach (0-68 then 69-99). But it's kind of "standard" here, so I don't see anything wrong with using it here.