Home > Enterprise >  PySpark remove special chars in all col names for all special chars - error cannot resolve given col
PySpark remove special chars in all col names for all special chars - error cannot resolve given col

Time:09-24

I'm trying to remove special characters from names in a dataframe created from a csv. There are 100's of columns some with long names. I've tried a number of approaches all returning an error on at least one of the columns?

df = spark.read.format("com.databricks.spark.csv") \
  .option("mode", "DROPMALFORMED") \
  .option("header", "true") \
  .option("inferschema", "true") \
  .option("delimiter", ",").load(getArgument('sourceCSVpath')   getArgument('sourceCSV'))

df = df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$] ","",col)) for col in df.columns])

temp_df1 = df.select([F.col(col).alias(col.replace('- ', '')) for col in df.columns])

ERROR cannot resolve 'Organization - No. Of Employees - Employee Figures Date' given input columns

Domestic Ultimate Employee Information Scope Code,NACE Revision 2 Description - Priority 4,NACE Revision 2 Description - Priority 5,NACE Revision 2 Description - Priority 6,Organization - No. Of Employees - Employee Figures Date,Number of Employees Scope Text,Organization Founded Date,NACE Revision 2 Description - Priority 1
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities

CodePudding user response:

To remove dots (or any other unwanted characters) from the column names you can use DataFrame.toDF:

temp_df1 = df.toDF(*[re.sub("\.", "", col) for col in df.columns])

toDF only takes the order of the columns into account. The original names of the columns which could cause troubles in a select statement are not used.

  • Related