Home > Enterprise >  Spark - Scala Remove special character from the beginning and end from columns in a dataframe
Spark - Scala Remove special character from the beginning and end from columns in a dataframe

Time:09-17

I have a dataframe like this,

scala> df.printSchema
root
 |-- Protocol ID: decimal(12,0) (nullable = true)
 |-- Protocol #: string (nullable = true)
 |-- Eudract #: string (nullable = true)
 |-- STDY_MIGRATED_INDC: string (nullable = true)
 |-- # Non-US Count: decimal(7,0) (nullable = true)
 |-- # US Count: decimal(7,0) (nullable = true)
 

here the data columns have space and special characters in it. I wanted to replace that with underscore like this,

scala> newdf.printSchema
root
 |-- Protocol_ID: decimal(12,0) (nullable = true)
 |-- Protocol: string (nullable = true)
 |-- Eudract: string (nullable = true)
 |-- STDY_MIGRATED_INDC: string (nullable = true)
 |-- Non-US_Count: decimal(7,0) (nullable = true)
 |-- US_Count: decimal(7,0) (nullable = true)

So I used the below steps,

val df=spark.read.format("parquet").load("<s3 path>")
val regex_string="""[ ._(),!#$%&"*./:;<-> ] """
val replacingColumns = df.columns.map(regex_string.r.replaceAllIn(_, "_"))
val resultDF = replacingColumns.zip(df.columns).foldLeft(df){
   (tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)
 }
resultDF.printSchema

But Iam getting the df like this.

scala> resultDF.printSchema
root
 |-- Protocol_ID: decimal(12,0) (nullable = true)
 |-- Protocol_: string (nullable = true)
 |-- Eudract_: string (nullable = true)
 |-- STDY_MIGRATED_INDC: string (nullable = true)
 |-- _Non-US_Count: decimal(7,0) (nullable = true)
 |-- _US_Count: decimal(7,0) (nullable = true)

If the space or special character is in the beginning or end then I dont want the underscore. In python I can use,

starts_with = [i.replace("_","",1) if i.startswith("_") else i for i in df.columns]
[(i[::-1].replace("_","",1)[::-1]) if i.endswith("_") else i for i in starts_with] 

As I am new to scala I am not sure how to fix this. Any help would be appreciated.

CodePudding user response:

You can use (^_|_$) regex to replace beginning or ending _ with empty string.

val regex_string = """[ ._(),!#$%&"*./:;<-> ] """
val col = regex_string.r.replaceAllIn("#Non-US Count##", "_")

println(col)
println("(^_|_$)".r.replaceAllIn(col, ""))

// _Non-US_Count_
// Non-US_Count

val replacingColumns = df.columns.map(s=>"(^_|_$)".r.replaceAllIn(regex_string.r.replaceAllIn(s, "_"),""))
  • Related