Home > database >  How to replace only first two occurrences of a character from a string in pyspark dataframe column?
How to replace only first two occurrences of a character from a string in pyspark dataframe column?

Time:12-16

I have a pyspark dataframe df containing dates in string format in column - DTC like this -

DTC
11 AUG 2012 10:12
AUG 2012 10:20
13 AUG 2012 10:22

I want to replace first two spaces with hyphen for all dates in the column like this -

DTC
11-AUG-2012 10:12
AUG-2012 10:20
13-AUG-2012 10:22

Any suggestion ? Please note there are some partial dates in the column as well, so cant convert it to date data type which makes it null & I will loose the data. I want to preserve partial dates as well.

CodePudding user response:

You could parse the date with to_date using format "dd MMM yyyy HH:mm" and then format it with your desired format "dd-MMM-yyyy HH:mm" like this:

from pyspark.sql import functions as f

new_df = df\
    .withColumn("date", f.to_timestamp(f.col("DTC"), "dd MMM yyyy HH:mm"))\
    .withColumn("DTC", f.date_format(f.col("date"), "dd-MMM-yyyy HH:mm"))\
    .drop("date")

Another probably less generic approach would be to edit the string directly. One way at it is to use split with a limit of 3 fields and concat_ws:

new_df = df.withColumn("DTC", f.concat_ws("-", f.split("DTC", " ", 3)))

CodePudding user response:

In the case of "partial" dates, as mentioned in the comments of the other answer, to_timestamp would set them to null. In that case, I would use some regex. For instance, in the code below, I extract everything before the last space (date column). Then I extract everything after the last space (time column). Finally I concat them after replacing spaces by hyphens in the date. Note that I trim the date to get rid of the trailing space.

from pyspark.sql import functions as f

df = spark.createDataFrame([
    (1, '11 AUG 2012 10:12'),
    (2, 'AUG 2012 10:20'),
    (3, '2012 11:11')
], ['id', 'DTC'])

df\
    .withColumn("date", f.regexp_extract("DTC", "^.* ", 0))\
    .withColumn("time", f.regexp_extract("DTC", " [^ ]*$", 0))\
    .withColumn("DTC", f.concat(f.regexp_replace(f.trim("date"), " ", "-"), "time"))\
    .show()

which yields:

 --- ----------------- ----------- ------ 
| id|              DTC|       date|  time|
 --- ----------------- ----------- ------ 
|  1|11-AUG-2012 10:12|11 AUG 2012| 10:12|
|  2|   AUG-2012 10:20|   AUG 2012| 10:20|
|  3|       2012 11:11|       2012| 11:11|
 --- ----------------- ----------- ------ 

You may then drop the date and time columns.

  • Related