I am trying to load a xml file as string and then I want to do some xpath operation on it
Below Work
df=spark.createDataFrame([['<?xml version="1.0" encoding="UTF-8"?>\
<note>\
<to>Tove</to>\
<from>Jani</from>\
<heading>Reminder</heading>\
<body>Don\'t forget me this weekend!</body>\
</note>']],['value'])
df.printSchema()
df=df.selectExpr("xpath(value,'note/to/text()')")
Now I am trying to put the XML in a file and load it as text and then do similar operation on it
xml_file="\\path to the file,contents are exactly same as above example"
df=spark.read.option("wholetext", True).text(xml_file)
df=df.selectExpr('xpath(value,"note/to/text()")')
df.show()
Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 33) (10.191.197.4 executor 0): java.lang.RuntimeException: Error loading expression 'note/to/text()'
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 39; Premature end of file
Please can somebody help,exact same operation fails when trying to read from file. I DO NOT want to read the file as xml,due to project requirements I have to load the entire XML as string and then do xpath operations to extract specific tags
Please suggest
CodePudding user response:
The most probable reason you are getting a premature end is that the XML is in multiple lines so when read it breaks into multiple rows and spark cannot identify where a tag starts and where it ends so try having the text in a file in a single line before using Xpath