I want to use Unicode char with Nokogiri::XML::DocumentFragment
.
frag = Nokogiri::XML::DocumentFragment.parse("<foo>ü</foo>")
=> <foo>ü</foo>
The unicode char is escaped. I need to set encoding: 'UTF-8'
to get a readable char.
frag.to_html(encoding: 'UTF-8')
=> "<foo>ü</foo>"
Is there a option for encoding when parsing the string?
Nokogiri::HTML::DocumentFragment.parse
treat the string as I expected, but I need to use XML
.
frag = Nokogiri::HTML::DocumentFragment.parse("<foo>ü</foo>")
=> <foo>ü</foo>
CodePudding user response:
According to the documentation here the text is internally stored as UTF-8 already.
Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return XML (like to_xml, to_html and inner_html) will return a string encoded like the source document.
So if you call for example #text
on your frag
instead of printing the entire frag
object, you'll see the ü printed correctly
puts frag.text
# => ü
Otherwise you can use #XML
instead of #DocumentFragment
directly and pass the encoding explicitly.
doc = Nokogiri.XML('<foo>ü</foo>', nil, 'UTF-8')
puts doc
# => <?xml version="1.0" encoding="UTF-8"?>
# => <foo>ü</foo>