I am using glue version 3.0, python version 3, spark version 3.1. I am extracting data from xml creating dataframe and writing data to s3 path in csv format. Before writing dataframe I printed the schema and 1 record of dataframe using show(1). till this point everything was fine. but while writing it to csv file in s3 location got error duplicate column found as my dataframe had 2 columns namely "Title" and "title". tried to add a new column title2 which will have content of title and thought of dropping title later with below command
from pyspark.sql import functions as f
df=df.withcoulumn('title2',f.expr("title"))
but was getting error Reference 'title' is ambiguous, could be: title, title Trieddf=df.withcoulumn('title2',f.col("title"))
got same error. any help or approach to resolve this please..
CodePudding user response:
By default spark is case in-sensitive, we can make spark sensitive by setting the spark.sql.caseSensitive
to True
.
from pyspark.sql import functions as f
df = spark.createDataFrame([("CaptializedTitleColumn", "title_column", ), ], ("Title", "title", ))
spark.conf.set('spark.sql.caseSensitive', True)
df.withColumn('title2',f.expr("title")) .show()
Output
-------------------- ------------ ------------
| Title| title| title2|
-------------------- ------------ ------------
|CaptializedTitleC...|title_column|title_column|
-------------------- ------------ ------------