Home > OS >  Parsing an XML file using nokogiri to create \index fields for LaTeX
Parsing an XML file using nokogiri to create \index fields for LaTeX

Time:09-06

I'm a professional indexer new to Ruby and nokogiri and I am in need of some assistance.

I'm working on a set of macros that will allow me to take an XML file, output from my indexing software, and parse it into valid \index{} commands for inclusion in a LaTeX source file. Each XML <record> contains at least two <field> tags, so I will have to iterate over the multiple <field> tags to build my \index{} entry.

The following is an example of an index record from the xml file.

<record time="2022-08-27T17:25:12" id="30">
    <field><text style="i"/><hide>SS </hide>Titanic<text/></field>
    <field>passengers</field>
    <field ><text style="b"/>5<text/></field>
</record>

I will produce intermediate output of this record in the form of:

\index{Titanic@\textit{SS Titanic}!passengers|textbf} 5

(The numeric locator is used to place the \index{} entry at the correct spot in the LaTex file and won't be included in the LaTeX source file)

I am using nokogiri to manipulate the xml file and have been able to reach the point where I return a nodelist that contains just the <field> tags for each <record>, but I need to be able to retrieve all the text in the <field>, including the formatting information (if I use the text method on a <field>, it returns "SS Titanic" for example, with all formatting information stripped away).

I'm stuck on how to access the entire text string in the <field> tag. Once I can get that, I have a good idea of how to structure my parser.

Any help will be greatly appreciated.

CodePudding user response:

does this help?

xml = "<record time="2022-08-27T17:25:12" id="30">
    <field><text style="i"/><hide>SS </hide>Titanic<text/></field>
    <field>passengers</field>
    <field ><text style="b"/>5<text/></field>
</record>"

fields = Nokogiri::XML(xml).xpath(".//field")

puts fields.first.text  #=> "SS Titanic"
puts fields.map(&:text) #=> ["SS Titanic", "passengers", "5"]
  • Related