Home > Blockchain >  How to search for nodes at the uppermost levels only in Ruby-Nokogiri?
How to search for nodes at the uppermost levels only in Ruby-Nokogiri?

Time:08-24

HTML (particularly MathML) can be heavily nested. With Ruby-Nokogiri, I want to search for a node at the uppermost levels, which are arbitrary, within a parent node. Here is an example HTML/MathML.

  1. <math><semantics>… (arbitrary depth)
    1. <mrow> (call it (1))
      1. <mrow> (1-1)
    2. <mrow> (2)
      1. <mrow> (2-1)
      2. <mrow> (2-2)

For a Nokogiri::HTML object for it, page, page.css("math mrow") returns an Array of all the nodes of <mrow>, having an Array size of 5 in this case, with the last node being "<mrow> (2-2)".

My goal is to identify the last <mrow> node at the upper-most level, i.e., "<mrow> (2)" in the example above (so that I can add another node after it).

In other words, I want to get the "last node of a certain kind at the shallowest depth among all the nodes of the kind". The depth of the uppermost level for the type of node is unknown and so I cannot limit the depth for the search.

CodePudding user response:

Sounds like a breadth-first search problem at 1st glance: add a node to the queue, if it's not of the desired type remove it and add its children to the queue in reversed order, repeat until you find the desired node (it will be the shallowest one because of BFS properties and the last one because we add children reversed).

Quick and dirty example:

require "nokogiri"

def find_last_shallowest(root)
  queue = [root]

  while queue.any?
    element = queue.shift
    break element if matching?(element)
    queue  = element.children.reverse
  end
end

def matching?(element)
  # Put your matching logic here
  element.name == "m"
end

doc = <<~XML
<foo>
  <bar>
    <m>
      <x></x>
    </m>
  </bar>
  <baz>
  </baz>
  <m>
    <y></y>
  </m>
  <m>
    <z></z>
  </m>
</foo>
XML

xml = Nokogiri::XML(doc)

find_last_shallowest(xml.root) # => #(Element:0xf744 { name = "m", children = [ #(Text "\n    "), #(Element:0xf758 { name = "z" }), #(Text "\n  ")] })

It finds m thing which has z as a child - which is the last and shallowest one...

CodePudding user response:

If you want the uppermost mrow node in terms of depth, you could select among all :first-of-type the one with the least number of ancestors:

first_mrow = page.css('mrow:first-of-type').min_by.with_index { |node, index| [node.ancestors.size, index] }

Adding with_index ensures that for nodes with identical number of ancestors, the first one will be picked.

To get the first mrow node from the start of the document (regardless of depth), you could simply use:

first_mrow = page.at_css('mrow')

With the first mrow node you can then select its parent node:

parent = first_mrow.parent

and finally retrieve the last element from the parent's (immediate) mrow nodes:

last_mrow = parent.css('> mrow').last

The latter can also be expressed via the :last-of-type CSS pseudo-class:

last_mrow = parent.at_css('> mrow:last-of-type')
  • Related