I'm trying to extract an excerpt for an article (markdown parsed to HTML), where only plain text from paragraphs is included. All HTML needs to be stripped and line breaks, tabs and sequential whitespace needs to be replaced by a single space.
My first step was creating a simple test:
describe "#from_html" do
it "creates an excerpt from given HTML" do
html = "<p>The spice extends <b>life</b>.<br>The spice expands consciousness.</p>\n
<ul><li>Skip me</li></ul>\n
<p>The <i>spice</i> is vital to space travel.</p>"
text = "The spice extends life. The spice expands consciousness. The spice is vital to space travel."
expect(R::ExcerptHelper.from_html(html)).to eq(text)
end
end
And started fiddling and came up with this:
def from_html(html)
Nokogiri::HTML.parse(html).css("p").map{|node|
node.children.map{|child|
child.name == "br" ? child.replace(" ") : child
} << " "
}.join.strip.gsub(/\s /, " ")
end
I'm a bit Rusty on Rails and this can probably be done much more efficient and elegant. I'm hoping for some pointers here.
Thanks in advance!
Approach 2
Turned to the sanitize method (thanks @max) and writing a custom scrubber based on Rails::Html::PermitScrubber
Approach 3
Realizing my source document is formatted as Markdown, I ventured forth by exploring a custom Redcarpet renderer.
See my answer for a complete example.
CodePudding user response:
I ended up writing a custom Redcarpet renderer (inspired by Redcarpet::Render::StripDown
). which seems the cleanest approach with the least parsing and converting between formats.
module R::Markdown
class ExcerptRenderer < Redcarpet::Render::Base
# Methods where the first argument is the text content
[
# block-level calls
:paragraph,
# span-level calls
:codespan, :double_emphasis,
:emphasis, :underline, :raw_html,
:triple_emphasis, :strikethrough,
:superscript, :highlight, :quote,
# footnotes
:footnotes, :footnote_def, :footnote_ref,
# low level rendering
:entity, :normal_text
].each do |method|
define_method method do |*args|
args.first
end
end
# Methods where content is replaced with an empty space
[
:autolink, :block_html
].each do |method|
define_method method do |*|
" "
end
end
# Methods we are going to [snip]
[
:list, :image, :table, :block_code
].each do |method|
define_method method do |*|
" [#{method}] "
end
end
# Other methods
def link(link, title, content)
content
end
def header(text, header_level)
" #{text} "
end
def block_quote(quote)
" “#{quote}” "
end
# Replace all whitespace with single space
def postprocess(document)
document.gsub(/\s /, " ").strip
end
end
end
And parse it:
extensions = {
autolink: true,
disable_indented_code_blocks: true,
fenced_code_blocks: true,
lax_spacing: true,
no_intra_emphasis: true,
strikethrough: true,
superscript: true,
tables: true
}
markdown = Redcarpet::Markdown.new(R::Markdown::ExcerptRenderer, extensions)
markdown.render(md).html_safe