I have an XML file stored in Azure Data Lake which I need to read from Synapse notebook. But when I read this using spark-xml library, I get this error:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `d:col`
Sample xml looks like this:
<m:properties>
<d:FileSystemObjectType m:type="Edm.Int32">0</d:FileSystemObjectType>
<d:Id m:type="Edm.Int32">10</d:Id>
<d:Modified m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Modified>
<d:Created m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Created>
<d:ID m:type="Edm.Int32">10</d:ID>
<d:Title m:null="true" />
<d:Description m:type="Edm.String">Test</d:Description>
<d:PurposeCode m:type="Edm.Int32">1</d:PurposeCode>
</m:properties>
Notice there are tags for d:Id and d:ID which are causing the duplicate error. Found this documentation that states that although they are of different case, they are considered duplicate: https://docs.microsoft.com/en-us/azure/databricks/kb/sql/dupe-column-in-metadata But I cannot modify the xml and have to read as it is. Is there a work around so I can still read the xml?
Or, is there a way to read the xml without using spark? I'm thinking of reading the xml file using the scala.xml.XML library to load the file and parse the file. But when I attempt this, I get an error:
abfss:/<container>@<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml (No such file or directory)
Code snippet below:
import scala.xml.XML
val xml = XML.loadFile("abfss://<container>@<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml")
Note: error really only displayed abfss:/ as opposed to the path on the parameter which has //
Thanks.
CodePudding user response:
Found a way to set spark to be case sensitive and is able now to read the xml successfully:
spark.conf.set("spark.sql.caseSensitive", "true")