Home > Enterprise >  How To Read XML File from Azure Data Lake In Synapse Notebook without Using Spark
How To Read XML File from Azure Data Lake In Synapse Notebook without Using Spark

Time:04-28

I have an XML file stored in Azure Data Lake which I need to read from Synapse notebook. But when I read this using spark-xml library, I get this error:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `d:col`

Sample xml looks like this:

<m:properties>
            <d:FileSystemObjectType m:type="Edm.Int32">0</d:FileSystemObjectType>
            <d:Id m:type="Edm.Int32">10</d:Id>
            <d:Modified m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Modified>
            <d:Created m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Created>
            <d:ID m:type="Edm.Int32">10</d:ID>
            <d:Title m:null="true" />
            <d:Description m:type="Edm.String">Test</d:Description>
            <d:PurposeCode m:type="Edm.Int32">1</d:PurposeCode>
</m:properties>

Notice there are tags for d:Id and d:ID which are causing the duplicate error. Found this documentation that states that although they are of different case, they are considered duplicate: https://docs.microsoft.com/en-us/azure/databricks/kb/sql/dupe-column-in-metadata But I cannot modify the xml and have to read as it is. Is there a work around so I can still read the xml?

Or, is there a way to read the xml without using spark? I'm thinking of reading the xml file using the scala.xml.XML library to load the file and parse the file. But when I attempt this, I get an error:

abfss:/<container>@<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml (No such file or directory)

Code snippet below:

import scala.xml.XML
val xml = XML.loadFile("abfss://<container>@<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml")

Note: error really only displayed abfss:/ as opposed to the path on the parameter which has //

Thanks.

CodePudding user response:

Found a way to set spark to be case sensitive and is able now to read the xml successfully:

spark.conf.set("spark.sql.caseSensitive", "true")
  • Related