Xpath operation on xml file-CodePudding

I am trying to load a xml file as string and then I want to do some xpath operation on it

Below Work

df=spark.createDataFrame([['<?xml version="1.0" encoding="UTF-8"?>\
<note>\
  <to>Tove</to>\
  <from>Jani</from>\
  <heading>Reminder</heading>\
  <body>Don\'t forget me this weekend!</body>\
</note>']],['value'])

df.printSchema()
df=df.selectExpr("xpath(value,'note/to/text()')")

Now I am trying to put the XML in a file and load it as text and then do similar operation on it

xml_file="\\path to the file,contents are exactly same as above example"
df=spark.read.option("wholetext", True).text(xml_file)
df=df.selectExpr('xpath(value,"note/to/text()")')
df.show()

Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 33) (10.191.197.4 executor 0): java.lang.RuntimeException: Error loading expression &#39;note/to/text()&#39

Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 39; Premature end of file

Please can somebody help,exact same operation fails when trying to read from file. I DO NOT want to read the file as xml,due to project requirements I have to load the entire XML as string and then do xpath operations to extract specific tags

Please suggest

CodePudding user response：

The most probable reason you are getting a premature end is that the XML is in multiple lines so when read it breaks into multiple rows and spark cannot identify where a tag starts and where it ends so try having the text in a file in a single line before using Xpath