I am writing a rake task to change HTML string to JSON for which I am using Nokogiri to parse the HTML string and build JSON, everything is going fine until I noticed that if I have an inner text like
< 109
or
> 109
then nokogiri returns "109" instead of "> 109" or " < 109"
if I have a string like
str = <td>< 109</td>
then
result = Nokogiri::XML(str)
will return
#(Document:0x115f8 {
name = "document",
children = [ #(Element:0x1160c { name = "td", children = [ #(Text " 109")] })]
})
and
result.children.children.to_s
will return " 109" but i need "< 109"
How can i get desire result?
I am expecting to get "< 109" instaed of just " 109"
CodePudding user response:
You could replace Nokogiri::XML
with Nokogiri::HTML
, which is more permissive with wrong syntax :
Nokogiri::XML('<td>< 109</td>').children.last.text # => " 109"
Nokogiri::HTML('<td>< 109</td>').children.last.text # => "< 109"
CodePudding user response:
It's a broken HTML, if this is the only issue that you are trying to solve then you can fix HTML before parsing it. You can replace all <
with <
.
str = '<td>< 109</td>'
fixed_str = str.gsub(/>< ([0-9] )</, '>< \1<')
=> "<td>< 109</td>"
result = Nokogiri::XML(str)
=> #(Document:0x2ac1be2860cc { name = "document", children = [ #(Element:0x2ac1be282940 { name = "td", children = [ #(Text "< 109")] })] })
If there are >
chars too
fixed_str = str.gsub(/>< ([0-9] )</, '>< \1<').gsub(/>> ([0-9] )</, '>> \1<')