Home > Software design >  Create text excerpt from HTML paragraphs in Rails
Create text excerpt from HTML paragraphs in Rails

Time:03-11

I'm trying to extract an excerpt for an article (markdown parsed to HTML), where only plain text from paragraphs is included. All HTML needs to be stripped and line breaks, tabs and sequential whitespace needs to be replaced by a single space.

My first step was creating a simple test:

describe "#from_html" do
  it "creates an excerpt from given HTML" do
    html = "<p>The spice extends <b>life</b>.<br>The spice    expands consciousness.</p>\n
           <ul><li>Skip me</li></ul>\n
           <p>The <i>spice</i> is vital to space travel.</p>"

    text = "The spice extends life. The spice expands consciousness. The spice is vital to space travel."

    expect(R::ExcerptHelper.from_html(html)).to eq(text)
  end
end

And started fiddling and came up with this:

def from_html(html)
  Nokogiri::HTML.parse(html).css("p").map{|node|
    node.children.map{|child|
      child.name == "br" ? child.replace(" ") : child
    } << " "
  }.join.strip.gsub(/\s /, " ")
end

I'm a bit Rusty on Rails and this can probably be done much more efficient and elegant. I'm hoping for some pointers here.

Thanks in advance!


Approach 2

Turned to the sanitize method (thanks @max) and writing a custom scrubber based on Rails::Html::PermitScrubber


Approach 3

Realizing my source document is formatted as Markdown, I ventured forth by exploring a custom Redcarpet renderer.

See my answer for a complete example.

CodePudding user response:

I ended up writing a custom Redcarpet renderer (inspired by Redcarpet::Render::StripDown). which seems the cleanest approach with the least parsing and converting between formats.

module R::Markdown
  class ExcerptRenderer < Redcarpet::Render::Base
    # Methods where the first argument is the text content
    [
      # block-level calls
      :paragraph,

      # span-level calls
      :codespan, :double_emphasis,
      :emphasis, :underline, :raw_html,
      :triple_emphasis, :strikethrough,
      :superscript, :highlight, :quote,

      # footnotes
      :footnotes, :footnote_def, :footnote_ref,

      # low level rendering
      :entity, :normal_text
    ].each do |method|
      define_method method do |*args|
        args.first
      end
    end

    # Methods where content is replaced with an empty space
    [
      :autolink, :block_html
    ].each do |method|
      define_method method do |*|
        " "
      end
    end

    # Methods we are going to [snip]
    [
      :list, :image, :table, :block_code
    ].each do |method|
      define_method method do |*|
        " [#{method}] "
      end
    end

    # Other methods
    def link(link, title, content)
      content
    end

    def header(text, header_level)
      " #{text} "
    end

    def block_quote(quote)
      " “#{quote}” "
    end

    # Replace all whitespace with single space
    def postprocess(document)
      document.gsub(/\s /, " ").strip
    end
  end
end

And parse it:

extensions = {
  autolink:                     true,
  disable_indented_code_blocks: true,
  fenced_code_blocks:           true,
  lax_spacing:                  true,
  no_intra_emphasis:            true,
  strikethrough:                true,
  superscript:                  true,
  tables:                       true
}

markdown = Redcarpet::Markdown.new(R::Markdown::ExcerptRenderer, extensions)

markdown.render(md).html_safe
  • Related