I'm trying to remove special characters from names in a dataframe created from a csv. There are 100's of columns some with long names. I've tried a number of approaches all returning an error on at least one of the columns?
df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(getArgument('sourceCSVpath') getArgument('sourceCSV'))
df = df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$] ","",col)) for col in df.columns])
temp_df1 = df.select([F.col(col).alias(col.replace('- ', '')) for col in df.columns])
ERROR
cannot resolve 'Organization - No. Of Employees - Employee Figures Date
' given input columns
Domestic Ultimate Employee Information Scope Code,NACE Revision 2 Description - Priority 4,NACE Revision 2 Description - Priority 5,NACE Revision 2 Description - Priority 6,Organization - No. Of Employees - Employee Figures Date,Number of Employees Scope Text,Organization Founded Date,NACE Revision 2 Description - Priority 1
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
CodePudding user response:
To remove dots (or any other unwanted characters) from the column names you can use DataFrame.toDF:
temp_df1 = df.toDF(*[re.sub("\.", "", col) for col in df.columns])
toDF
only takes the order of the columns into account. The original names of the columns which could cause troubles in a select statement are not used.