PySpark remove special chars in all col names for all special chars - error cannot resolve given col-CodePudding

I'm trying to remove special characters from names in a dataframe created from a csv. There are 100's of columns some with long names. I've tried a number of approaches all returning an error on at least one of the columns?

df = spark.read.format("com.databricks.spark.csv") \
  .option("mode", "DROPMALFORMED") \
  .option("header", "true") \
  .option("inferschema", "true") \
  .option("delimiter", ",").load(getArgument('sourceCSVpath')   getArgument('sourceCSV'))

df = df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$] ","",col)) for col in df.columns])

temp_df1 = df.select([F.col(col).alias(col.replace('- ', '')) for col in df.columns])

ERROR cannot resolve 'Organization - No. Of Employees - Employee Figures Date' given input columns

Domestic Ultimate Employee Information Scope Code,NACE Revision 2 Description - Priority 4,NACE Revision 2 Description - Priority 5,NACE Revision 2 Description - Priority 6,Organization - No. Of Employees - Employee Figures Date,Number of Employees Scope Text,Organization Founded Date,NACE Revision 2 Description - Priority 1
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities

CodePudding user response：

To remove dots (or any other unwanted characters) from the column names you can use DataFrame.toDF:

temp_df1 = df.toDF(*[re.sub("\.", "", col) for col in df.columns])

toDF only takes the order of the columns into account. The original names of the columns which could cause troubles in a select statement are not used.