Can I select the number of characters in the function format

I have a dataframe with a StringType column called ssn which content is the following:

 --------- 
|      ssn|
 --------- 
|986-69-2371|
|957305210|
|965741549|
|996118261|
|946899347|
|974393395|
|998771903|
|949429820|
|983662042|
|923785843|
|948-84-1328|
|971-48-1142|
|972-79-2371|
|989498321|
|943129824|
|976849729|
|949-79-5213|
|924761199|
|991718011|
|988-25-4619|
 ---------

As you can see the content is not homogeneous. My intention is to format the content so that all rows have the same format (the format I want is xxx-xx-xxxx). All the rows have the 9 numbers. I have done it with the following function:

df.withColumn("ssnFormat", when(col("ssn").contains("-"), col("ssn")).otherwise(format_string("%s-%s-%s", col("ssn").substr(1, 3), col("ssn").substr(4, 2), col("ssn").substr(6, 4))))

I was wondering, could I do it in a way that in the format expression I specified the characters that I want to take from the string?

This is what I mean:

format_string("<-,-L", col("ssn"))

In the previous instruction my intention was to select the first characters of the string, then write the hyphen, the 2 following characters, the hyphen and then the remaining characters.

I want to find if this is a possible way because I don't like much the fact that I have to do substring of the parts I want.

CodePudding user response：

You can use regexp_replace function instead by capturing the groups xxx-xx-xxxx and replacing by $1-$2-$3:

val df1 = df.withColumn(
  "ssn",
  regexp_replace(col("ssn"), "^(\\d{3})-?(\\d{2})-?(\\d{4})$", "$1-$2-$3")
)

df1.show
// ----------- 
//|        ssn|
// ----------- 
//|986-69-2371|
//|957-30-5210|
//|965-74-1549|
//|996-11-8261|
//|946-89-9347|
//|974-39-3395|
//|998-77-1903|
//|949-42-9820|
//|983-66-2042|
//|923-78-5843|
//|948-84-1328|
//|971-48-1142|
//|972-79-2371|
//|989-49-8321|
//|943-12-9824|
//|976-84-9729|
//|949-79-5213|
//|924-76-1199|
//|991-71-8011|
//|988-25-4619|
// -----------

CodePudding user response：

I think I know what you mean, similar to the Excel custom formats. One method is to use overlay to insert the dashes at the right point. This seems similar to the SQL STUFF function. I am still having to use regexp_replace to make sure the strings are in a consistent format prior to the change so it's not that elegant:

df.select($"ssn", 
    overlay(
        overlay(
            regexp_replace($"ssn", "-", ""), lit("-"), lit(4), lit(0)), lit("-"), lit(7), lit(0)).
    alias("formatted_ssn")).show

My results (from Synapse Notebook, Scala language):