Home > Back-end >  How to linkify internal text document cross-references?
How to linkify internal text document cross-references?

Time:05-02

I'm making a web version of a collection of docs with plaintext like this:

...as found in article 6, depending on...

I'm writing code to add relative URL anchors (linkify):

...as found in <a href="article_6">article 6</a>, depending on...

I'm open to any programming language, and currently have Ruby regex code that handles this simple case:

    with_single_article_links = html.gsub(/(article \d )/i) do
      last_match = Regexp.last_match
      "<a href=\"last_match.gsub(' ', '_')\">#{last_match}</a>"
    end

But I'm looking for ideas on handling more complex cases like these, with multiple citations:

  • ...as found in article 6 or 7, depending on...
  • ...as found in article 6, 7 or 8, depending on...
  • ...as found in article 6, 7 or 8 bis, depending on...

If I keep going with my current code, I'd probably have two levels of regexes: a first match for article \d , and then a second check for one of these complex cases.

But is there some other approach I could take? I'm open to any programming language and technique. This is basically a reality check for me that I'm using a decent method.

Update: Expanding the regex, this is working so far:

article (\d )((, \d )* or (\d ))?

Live view: https://regex101.com/r/WHtM5C/1

The second group will just need some simple parsing of the comma-separated list.

CodePudding user response:

I know this is going to seem like total overkill and really verbose, but the first thing that comes to mind to use a builder pattern by splitting your input into tokens and then converting each token based on where you are in the stream.

input = "as found in article 6 or 7, depending on\nas found in article 6, 7 or 8, depending on\nas found in article 6, 7 or 8 bis, depending on"

class TextReader
  attr_reader :builder, :text

  def initialize(text, builder)
    @text = text
    @builder = builder
  end

  def parse()
    stream = text.split(/(?=\s|,)/)
    stream.each do |token|
      case token
      when /^\s $/
        builder.convert_space(token)
      when /^\s*,$/, /^\s or$/
        builder.convert_joiner(token)
      when /^\s*\d $/
        builder.convert_number(token)
      when /^\s*as$/
        builder.convert_as(token)
      when /^\s*found$/
        builder.convert_found(token)
      when /^\s*in$/
        builder.convert_in(token)
      when /^\s*article$/
        builder.convert_article(token)
      else
        builder.convert_other(token)
      end
    end
  end
end

class HTMLBuilder
  attr_reader :html

  def initialize()
    @html = ""
  end

  def convert_space(token)
    html << token
  end

  def convert_joiner(token)
    @joiner = true
    html << token
  end

  def convert_other(token)
    @as = @found = @in = @article = @joiner = false
    html << token
  end

  def convert_number(token)
    token =~ /^\s*(\d )/
    if @article
      if @joiner
        html << " <a href=\"article_#{$1}\" #{$1}>"
      else
        html << " <a href=\"article_#{$1}\" article #{$1}>"
      end
    else
      html << token
    end
  end

  def convert_as(token)
    @as = true
    html << token
  end

  def convert_found(token)
    @found = true if @as
    html << token
  end

  def convert_in(token)
    @in = true if @found
    html << token
  end

  def convert_article(token)
    @article = true if @in
  end
end

builder = HTMLBuilder.new
reader = TextReader.new(input, builder)
reader.parse
puts "output:"
puts builder.html


=>
output:
as found in <a href="article_6" article 6> or <a href="article_7" 7>, depending on
as found in <a href="article_6" article 6>, <a href="article_7" 7> or <a href="article_8" 8>, depending on
as found in <a href="article_6" article 6>, <a href="article_7" 7> or <a href="article_8" 8> bis, depending on

CodePudding user response:

I added a second answer because I did not want to make any major changes after the first answer had been up voted.

As you noticed, this is sort of a state machine so you can start "building" a number when you first see the digits and then complete the number when you reach a token that indicates you have reached the end of the number definition. If the number building gets complicated you can even start a nested builder, ie a NumberBuilder and send tokens to that until you reach the end of the number definition and then ask the builder for the number.

input = "as found in article 6 or 7, depending on\nas found in article 6, 7 bis or 8, depending on\nas found in article 6, 7 or 8 bis, depending on"

class TextReader
  attr_reader :builder, :text

  def initialize(text, builder)
    @text = text
    @builder = builder
  end

  def parse()
    stream = text.split(/(?=\s|,)/)
    stream.each do |token|
      case token
      when /^\s $/
        builder.convert_space(token)
      when /^\s*,$/, /^\s or$/
        builder.convert_joiner(token)
      when /^\s*\d $/
        builder.convert_digits(token)
      when /^\s*as$/
        builder.convert_as(token)
      when /^\s*found$/
        builder.convert_found(token)
      when /^\s*in$/
        builder.convert_in(token)
      when /^\s*article$/
        builder.convert_article(token)
      when /^\s*bis$/
        builder.convert_bis(token)
      else
        builder.convert_other(token)
      end
    end
  end
end

class HTMLBuilder
  attr_reader :html

  def initialize()
    @html = ""
  end

  def convert_space(token)
    html << token
  end

  def convert_joiner(token)
    @joiner = true
    process_number if @number
    html << token
  end

  def convert_other(token)
    process_number if @number
    @as = @found = @in = @article = @joiner = @number = false
    html << token
  end

  def convert_digits(token)
    @number = token   
  end

  def convert_bis(token)
    if @number 
        @number << token
        process_number
    else
        html << token
    end
  end

  def process_number()
    token = @number
    @number = false
    token =~ /^\s*(\d )(. )*/
    if @article
      if @joiner
        html << " <a href=\"article_#{$1}#{$2}\" #{$1}#{$2}>"
      else
        html << " <a href=\"article_#{$1}#{$2}\" article #{$1}#{$2}>"
      end
    else
      html << token
    end
  end

  def convert_as(token)
    @as = true
    html << token
  end

  def convert_found(token)
    @found = true if @as
    html << token
  end

  def convert_in(token)
    @in = true if @found
    html << token
  end

  def convert_article(token)
    @article = true if @in
  end
end

builder = HTMLBuilder.new
reader = TextReader.new(input, builder)
reader.parse
puts "output:"
puts builder.html

=>
output:
as found in <a href="article_6" 6> or <a href="article_7" 7>, depending on
as found in <a href="article_6" 6>, <a href="article_7 bis" 7 bis> or <a href="article_8" 8>, depending on
as found in <a href="article_6" 6>, <a href="article_7" 7> or <a href="article_8 bis" 8 bis>, depending on
  • Related