I have a dataframe like this,
scala> df.printSchema
root
|-- Protocol ID: decimal(12,0) (nullable = true)
|-- Protocol #: string (nullable = true)
|-- Eudract #: string (nullable = true)
|-- STDY_MIGRATED_INDC: string (nullable = true)
|-- # Non-US Count: decimal(7,0) (nullable = true)
|-- # US Count: decimal(7,0) (nullable = true)
here the data columns have space and special characters in it. I wanted to replace that with underscore like this,
scala> newdf.printSchema
root
|-- Protocol_ID: decimal(12,0) (nullable = true)
|-- Protocol: string (nullable = true)
|-- Eudract: string (nullable = true)
|-- STDY_MIGRATED_INDC: string (nullable = true)
|-- Non-US_Count: decimal(7,0) (nullable = true)
|-- US_Count: decimal(7,0) (nullable = true)
So I used the below steps,
val df=spark.read.format("parquet").load("<s3 path>")
val regex_string="""[ ._(),!#$%&"*./:;<-> ] """
val replacingColumns = df.columns.map(regex_string.r.replaceAllIn(_, "_"))
val resultDF = replacingColumns.zip(df.columns).foldLeft(df){
(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)
}
resultDF.printSchema
But Iam getting the df like this.
scala> resultDF.printSchema
root
|-- Protocol_ID: decimal(12,0) (nullable = true)
|-- Protocol_: string (nullable = true)
|-- Eudract_: string (nullable = true)
|-- STDY_MIGRATED_INDC: string (nullable = true)
|-- _Non-US_Count: decimal(7,0) (nullable = true)
|-- _US_Count: decimal(7,0) (nullable = true)
If the space or special character is in the beginning or end then I dont want the underscore. In python I can use,
starts_with = [i.replace("_","",1) if i.startswith("_") else i for i in df.columns]
[(i[::-1].replace("_","",1)[::-1]) if i.endswith("_") else i for i in starts_with]
As I am new to scala I am not sure how to fix this. Any help would be appreciated.
CodePudding user response:
You can use (^_|_$)
regex to replace beginning or ending _
with empty string.
val regex_string = """[ ._(),!#$%&"*./:;<-> ] """
val col = regex_string.r.replaceAllIn("#Non-US Count##", "_")
println(col)
println("(^_|_$)".r.replaceAllIn(col, ""))
// _Non-US_Count_
// Non-US_Count
val replacingColumns = df.columns.map(s=>"(^_|_$)".r.replaceAllIn(regex_string.r.replaceAllIn(s, "_"),""))