Home > other >  Nokogiri miss html inner text if it contains "<"
Nokogiri miss html inner text if it contains "<"

Time:12-22

I am writing a rake task to change HTML string to JSON for which I am using Nokogiri to parse the HTML string and build JSON, everything is going fine until I noticed that if I have an inner text like

< 109 

or

> 109 

then nokogiri returns "109" instead of "> 109" or " < 109"

if I have a string like

str = <td>< 109</td>

then

result = Nokogiri::XML(str)

will return

#(Document:0x115f8 {
  name = "document",
  children = [ #(Element:0x1160c { name = "td", children = [ #(Text " 109")] })]
  })

and

result.children.children.to_s 

will return " 109" but i need "< 109"

How can i get desire result?

I am expecting to get "< 109" instaed of just " 109"

CodePudding user response:

You could replace Nokogiri::XML with Nokogiri::HTML, which is more permissive with wrong syntax :

Nokogiri::XML('<td>< 109</td>').children.last.text  # => " 109"
Nokogiri::HTML('<td>< 109</td>').children.last.text # => "< 109"

CodePudding user response:

It's a broken HTML, if this is the only issue that you are trying to solve then you can fix HTML before parsing it. You can replace all < with &lt.

str = '<td>< 109</td>'

fixed_str = str.gsub(/>< ([0-9] )</, '>&lt; \1<')
=> "<td>&lt; 109</td>"

result = Nokogiri::XML(str)
=> #(Document:0x2ac1be2860cc { name = "document", children = [ #(Element:0x2ac1be282940 { name = "td", children = [ #(Text "< 109")] })] })

If there are > chars too

fixed_str = str.gsub(/>< ([0-9] )</, '>&lt; \1<').gsub(/>> ([0-9] )</, '>&gt; \1<')
  • Related