Home > Net >  How to use Nokogiri to get the full HTML without any text content
How to use Nokogiri to get the full HTML without any text content

Time:10-27

I'm trying to use Nokogiri to get a page's full HTML but with all of the text stripped out.

I tried this:

require 'nokogiri'
x = "<html>  <body>  <div class='example'><span>Hello</span></div></body></html>"
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]").each { |a| a.children.remove }
puts y.to_s

This outputs:

<div class="example"></div>

I've also tried running it without the children.remove part:

y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]")
puts y.to_s

But then I get:

<div class="example"><span>Hello</span></div>

But what I actually want is:

<html><body><div class='example'><span></span></div></body></html>

CodePudding user response:

I don't believe what you are proposing is necessarily a good solution to the problem you described, but if you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:

require 'nokogiri'
html = "<html>  <body>  <div class='example'><span>Hello</span></div></body></html>"

# Parse HTML
doc = Nokogiri::HTML.parse(html)

puts doc.inner_html
# => "<html>  <body>  <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"

# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }

puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"

CodePudding user response:

If you want to subtract text from the HTML string, you can gsub it with nothing (= remove it) using word boundaries:

html = "<html>  <body>  <div class='example'><span>Hello</span></div></body></html>"
doc  = Nokogiri::HTML(html)
text = doc.text

puts html.gsub(/\b#{text}\b/, '')
#=> "<html>  <body>  <div class='example'><span></span></div></body></html>"
  • Related