I'm making a web version of a collection of docs with plaintext like this:
...as found in article 6, depending on...
I'm writing code to add relative URL anchors (linkify):
...as found in <a href="article_6">article 6</a>, depending on...
I'm open to any programming language, and currently have Ruby regex code that handles this simple case:
with_single_article_links = html.gsub(/(article \d )/i) do
last_match = Regexp.last_match
"<a href=\"last_match.gsub(' ', '_')\">#{last_match}</a>"
end
But I'm looking for ideas on handling more complex cases like these, with multiple citations:
- ...as found in article 6 or 7, depending on...
- ...as found in article 6, 7 or 8, depending on...
- ...as found in article 6, 7 or 8 bis, depending on...
If I keep going with my current code, I'd probably have two levels of regexes: a first match for article \d
, and then a second check for one of these complex cases.
But is there some other approach I could take? I'm open to any programming language and technique. This is basically a reality check for me that I'm using a decent method.
Update: Expanding the regex, this is working so far:
article (\d )((, \d )* or (\d ))?
Live view: https://regex101.com/r/WHtM5C/1
The second group will just need some simple parsing of the comma-separated list.
CodePudding user response:
I know this is going to seem like total overkill and really verbose, but the first thing that comes to mind to use a builder pattern by splitting your input into tokens and then converting each token based on where you are in the stream.
input = "as found in article 6 or 7, depending on\nas found in article 6, 7 or 8, depending on\nas found in article 6, 7 or 8 bis, depending on"
class TextReader
attr_reader :builder, :text
def initialize(text, builder)
@text = text
@builder = builder
end
def parse()
stream = text.split(/(?=\s|,)/)
stream.each do |token|
case token
when /^\s $/
builder.convert_space(token)
when /^\s*,$/, /^\s or$/
builder.convert_joiner(token)
when /^\s*\d $/
builder.convert_number(token)
when /^\s*as$/
builder.convert_as(token)
when /^\s*found$/
builder.convert_found(token)
when /^\s*in$/
builder.convert_in(token)
when /^\s*article$/
builder.convert_article(token)
else
builder.convert_other(token)
end
end
end
end
class HTMLBuilder
attr_reader :html
def initialize()
@html = ""
end
def convert_space(token)
html << token
end
def convert_joiner(token)
@joiner = true
html << token
end
def convert_other(token)
@as = @found = @in = @article = @joiner = false
html << token
end
def convert_number(token)
token =~ /^\s*(\d )/
if @article
if @joiner
html << " <a href=\"article_#{$1}\" #{$1}>"
else
html << " <a href=\"article_#{$1}\" article #{$1}>"
end
else
html << token
end
end
def convert_as(token)
@as = true
html << token
end
def convert_found(token)
@found = true if @as
html << token
end
def convert_in(token)
@in = true if @found
html << token
end
def convert_article(token)
@article = true if @in
end
end
builder = HTMLBuilder.new
reader = TextReader.new(input, builder)
reader.parse
puts "output:"
puts builder.html
=>
output:
as found in <a href="article_6" article 6> or <a href="article_7" 7>, depending on
as found in <a href="article_6" article 6>, <a href="article_7" 7> or <a href="article_8" 8>, depending on
as found in <a href="article_6" article 6>, <a href="article_7" 7> or <a href="article_8" 8> bis, depending on
CodePudding user response:
I added a second answer because I did not want to make any major changes after the first answer had been up voted.
As you noticed, this is sort of a state machine so you can start "building" a number when you first see the digits and then complete the number when you reach a token that indicates you have reached the end of the number definition. If the number building gets complicated you can even start a nested builder, ie a NumberBuilder and send tokens to that until you reach the end of the number definition and then ask the builder for the number.
input = "as found in article 6 or 7, depending on\nas found in article 6, 7 bis or 8, depending on\nas found in article 6, 7 or 8 bis, depending on"
class TextReader
attr_reader :builder, :text
def initialize(text, builder)
@text = text
@builder = builder
end
def parse()
stream = text.split(/(?=\s|,)/)
stream.each do |token|
case token
when /^\s $/
builder.convert_space(token)
when /^\s*,$/, /^\s or$/
builder.convert_joiner(token)
when /^\s*\d $/
builder.convert_digits(token)
when /^\s*as$/
builder.convert_as(token)
when /^\s*found$/
builder.convert_found(token)
when /^\s*in$/
builder.convert_in(token)
when /^\s*article$/
builder.convert_article(token)
when /^\s*bis$/
builder.convert_bis(token)
else
builder.convert_other(token)
end
end
end
end
class HTMLBuilder
attr_reader :html
def initialize()
@html = ""
end
def convert_space(token)
html << token
end
def convert_joiner(token)
@joiner = true
process_number if @number
html << token
end
def convert_other(token)
process_number if @number
@as = @found = @in = @article = @joiner = @number = false
html << token
end
def convert_digits(token)
@number = token
end
def convert_bis(token)
if @number
@number << token
process_number
else
html << token
end
end
def process_number()
token = @number
@number = false
token =~ /^\s*(\d )(. )*/
if @article
if @joiner
html << " <a href=\"article_#{$1}#{$2}\" #{$1}#{$2}>"
else
html << " <a href=\"article_#{$1}#{$2}\" article #{$1}#{$2}>"
end
else
html << token
end
end
def convert_as(token)
@as = true
html << token
end
def convert_found(token)
@found = true if @as
html << token
end
def convert_in(token)
@in = true if @found
html << token
end
def convert_article(token)
@article = true if @in
end
end
builder = HTMLBuilder.new
reader = TextReader.new(input, builder)
reader.parse
puts "output:"
puts builder.html
=>
output:
as found in <a href="article_6" 6> or <a href="article_7" 7>, depending on
as found in <a href="article_6" 6>, <a href="article_7 bis" 7 bis> or <a href="article_8" 8>, depending on
as found in <a href="article_6" 6>, <a href="article_7" 7> or <a href="article_8 bis" 8 bis>, depending on