Home > Software engineering >  How to use Unicode chars with Nokogiri::XML::DocumentFragment
How to use Unicode chars with Nokogiri::XML::DocumentFragment

Time:05-14

I want to use Unicode char with Nokogiri::XML::DocumentFragment.

frag = Nokogiri::XML::DocumentFragment.parse("<foo>ü</foo>")
=> <foo>&#xFC;</foo>

The unicode char is escaped. I need to set encoding: 'UTF-8' to get a readable char.

frag.to_html(encoding: 'UTF-8')
=> "<foo>ü</foo>"

Is there a option for encoding when parsing the string?

Nokogiri::HTML::DocumentFragment.parse treat the string as I expected, but I need to use XML.

frag = Nokogiri::HTML::DocumentFragment.parse("<foo>ü</foo>")
=> <foo>ü</foo>

CodePudding user response:

According to the documentation here the text is internally stored as UTF-8 already.

Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return XML (like to_xml, to_html and inner_html) will return a string encoded like the source document.

So if you call for example #text on your frag instead of printing the entire frag object, you'll see the ü printed correctly

puts frag.text
# => ü

Otherwise you can use #XML instead of #DocumentFragment directly and pass the encoding explicitly.

doc = Nokogiri.XML('<foo>ü</foo>', nil, 'UTF-8')
puts doc

# => <?xml version="1.0" encoding="UTF-8"?>
# => <foo>ü</foo>   
  • Related