I have a pyspark dataframe df containing dates in string format in column - DTC like this -
DTC
11 AUG 2012 10:12
AUG 2012 10:20
13 AUG 2012 10:22
I want to replace first two spaces with hyphen for all dates in the column like this -
DTC
11-AUG-2012 10:12
AUG-2012 10:20
13-AUG-2012 10:22
Any suggestion ? Please note there are some partial dates in the column as well, so cant convert it to date data type which makes it null & I will loose the data. I want to preserve partial dates as well.
CodePudding user response:
You could parse the date with to_date
using format "dd MMM yyyy HH:mm"
and then format it with your desired format "dd-MMM-yyyy HH:mm"
like this:
from pyspark.sql import functions as f
new_df = df\
.withColumn("date", f.to_timestamp(f.col("DTC"), "dd MMM yyyy HH:mm"))\
.withColumn("DTC", f.date_format(f.col("date"), "dd-MMM-yyyy HH:mm"))\
.drop("date")
Another probably less generic approach would be to edit the string directly. One way at it is to use split
with a limit of 3 fields and concat_ws
:
new_df = df.withColumn("DTC", f.concat_ws("-", f.split("DTC", " ", 3)))
CodePudding user response:
In the case of "partial" dates, as mentioned in the comments of the other answer, to_timestamp
would set them to null
. In that case, I would use some regex. For instance, in the code below, I extract everything before the last space (date
column). Then I extract everything after the last space (time
column). Finally I concat them after replacing spaces by hyphens in the date. Note that I trim the date to get rid of the trailing space.
from pyspark.sql import functions as f
df = spark.createDataFrame([
(1, '11 AUG 2012 10:12'),
(2, 'AUG 2012 10:20'),
(3, '2012 11:11')
], ['id', 'DTC'])
df\
.withColumn("date", f.regexp_extract("DTC", "^.* ", 0))\
.withColumn("time", f.regexp_extract("DTC", " [^ ]*$", 0))\
.withColumn("DTC", f.concat(f.regexp_replace(f.trim("date"), " ", "-"), "time"))\
.show()
which yields:
--- ----------------- ----------- ------
| id| DTC| date| time|
--- ----------------- ----------- ------
| 1|11-AUG-2012 10:12|11 AUG 2012| 10:12|
| 2| AUG-2012 10:20| AUG 2012| 10:20|
| 3| 2012 11:11| 2012| 11:11|
--- ----------------- ----------- ------
You may then drop the date
and time
columns.