I have a dataframe with a StringType column called ssn
which content is the following:
---------
| ssn|
---------
|986-69-2371|
|957305210|
|965741549|
|996118261|
|946899347|
|974393395|
|998771903|
|949429820|
|983662042|
|923785843|
|948-84-1328|
|971-48-1142|
|972-79-2371|
|989498321|
|943129824|
|976849729|
|949-79-5213|
|924761199|
|991718011|
|988-25-4619|
---------
As you can see the content is not homogeneous. My intention is to format the content so that all rows have the same format (the format I want is xxx-xx-xxxx
). All the rows have the 9 numbers.
I have done it with the following function:
df.withColumn("ssnFormat", when(col("ssn").contains("-"), col("ssn")).otherwise(format_string("%s-%s-%s", col("ssn").substr(1, 3), col("ssn").substr(4, 2), col("ssn").substr(6, 4))))
I was wondering, could I do it in a way that in the format expression I specified the characters that I want to take from the string?
This is what I mean:
format_string("<-,-L", col("ssn"))
In the previous instruction my intention was to select the first characters of the string, then write the hyphen, the 2 following characters, the hyphen and then the remaining characters.
I want to find if this is a possible way because I don't like much the fact that I have to do substring of the parts I want.
CodePudding user response:
You can use regexp_replace
function instead by capturing the groups xxx-xx-xxxx
and replacing by $1-$2-$3
:
val df1 = df.withColumn(
"ssn",
regexp_replace(col("ssn"), "^(\\d{3})-?(\\d{2})-?(\\d{4})$", "$1-$2-$3")
)
df1.show
// -----------
//| ssn|
// -----------
//|986-69-2371|
//|957-30-5210|
//|965-74-1549|
//|996-11-8261|
//|946-89-9347|
//|974-39-3395|
//|998-77-1903|
//|949-42-9820|
//|983-66-2042|
//|923-78-5843|
//|948-84-1328|
//|971-48-1142|
//|972-79-2371|
//|989-49-8321|
//|943-12-9824|
//|976-84-9729|
//|949-79-5213|
//|924-76-1199|
//|991-71-8011|
//|988-25-4619|
// -----------
CodePudding user response:
I think I know what you mean, similar to the Excel custom formats. One method is to use overlay
to insert the dashes at the right point. This seems similar to the SQL STUFF
function. I am still having to use regexp_replace
to make sure the strings are in a consistent format prior to the change so it's not that elegant:
df.select($"ssn",
overlay(
overlay(
regexp_replace($"ssn", "-", ""), lit("-"), lit(4), lit(0)), lit("-"), lit(7), lit(0)).
alias("formatted_ssn")).show
My results (from Synapse Notebook, Scala language):