Home > other >  Complex text substitution algorithm or design pattern
Complex text substitution algorithm or design pattern

Time:09-17

I am in the need of doing multiple substitutions in a text coming from a database and before displaying it to the user.

My example is for data most likely found on a CRM and the output is HTML for web, but the question is generalizable to any other text-subtitution need. The question is general for any programming language. In my case I use PHP but it's more an algorithm question than a PHP question.

Problem

Each of the 3 examples I'm writing below are super-easy to do via regular expressions. But combining them in a single shot is not so direct even if I do multi-step substitutions. They interfere.

Question

Is there a design-pattern for doing multiple interferring text substitutions?

Example #1 of substitution: The IDs.

We work with IDs. The IDs are sha-1 digests. IDs are universal and can represent any entity in the company, from a user to an airport, from an invoice to a car.

So in the database we can find this text to be displayed to a user:

User d19210ac35dfc63bdaa2e495e17abe5fc9535f02 paid 50 EUR
in the payment 377b03b0b4e92502737eca2345e5bdadb1262230. We sent
an email a49c6737f80eadea0eb16f4c8e148f1c82e05c10 to confirm.

We want all IDs to be translated into links so the user watching it the info can click. There's one general URL for decoding IDs. Let's assume it's http://example.com/id/xxx

The transformed text would be this:

User <a href="http://example.com/id/d19210ac35dfc63bdaa2e495e17abe5fc9535f02">d19210ac35dfc63bdaa2e495e17abe5fc9535f02</a> paid 50 EUR
in the payment <a href="http://example.com/id/377b03b0b4e92502737eca2345e5bdadb1262230">377b03b0b4e92502737eca2345e5bdadb1262230</a>. We sent
an email <a href="http://example.com/id/a49c6737f80eadea0eb16f4c8e148f1c82e05c10">a49c6737f80eadea0eb16f4c8e148f1c82e05c10</a> to confirm

Example #2 of substitution: The Links

We want anything that ressembles a URI to be clickable. Let's focus only in http and https protocols and forget the rest.

If we find this in the database:

Our website is http://mary.example.com and the info
you are requesting is in this page http://mary.example.com/info.php

would be converted into this:

Our website is <a href="http://mary.example.com">http://mary.example.com</a> and the info
you are requesting is in this page <a href="http://mary.example.com/info.php">http://mary.example.com/info.php</a>

Example #3 of substitution: The HTML

When the original text contains HTML it must not be sent raw as it would be interpreted. We want to change the < and > chars into the escaped form &lt; and &gt;. The translation table for HTML-5 also contains the & symbol to be converted to &amp;This also affects the translation of the Message Ids of the emails, for example.

For example if we find this in the database:

We need to change the CSS for the <code> tag to a pure green.
Sent to John&Partners in Message-ID: <[email protected]> this morning.

The resulting substitution would be:

We need to change the CSS for the &lt;code&gt; tag to a pure green.
Sent to John&amp;Partners in Message-ID: &lt;[email protected]&gt; this morning.

Allright... But... combinations?

Up to here, every change "per se" is super-easy.

But when we combine things we want them to still be "natural" to the user. Let's assume that the original text contains HTML. And one of the tags is an <a> tag. We still want to see the complete tag "displayed" and the HREF be clickable. And also the text of the anchor if it was a link.

Combination sample: #2 (inject links) then #3 (flatten HTML)

Let's say we have this in the database:

Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.

If we first apply #2 to transform the links and then #3 to encode HTML we would have:

Applying rule #2 (inject links) on the original the link http://example.com/data.xmlis detected and subtituted by <a href="http://example.com/data.xml">http://example.com/data.xml</a>

Paste this <a class="dark" href="<a href="http://example.com/data.xml">http://example.com/data.xml</a>">Download</a> into your text editor.

which obviously is a broken HTML and makes no sense but, in addition, applying rule #3 (flatten HTML) on the output of #2 we would have:

Paste this &lt;a class="dark" href="&lt;a href="http://example.com/data.xml"&gt;http://example.com/data.xml&lt;/a&gt;"&gt;Download&lt;/a&gt; into your text editor.

which in turn is the mere flat HTML representation of the broken HTML and not clickable. Wrong output: Neither #2 nor #3 were satisfied.

Reversed combination: First #3 (flatten HTML) then #2 (inject links)

If I first apply rule #3 to "decode all HTML" and then afterwards I apply rule #2 to "inject links HTML" it happens this:

Original (same than above):

Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.

Result of applying #3 (flatten HTML)

Paste this &lt;a class="dark" href="http://example.com/data.xml">Download&lt;/a&gt; into your text editor.

Then we apply rule #2 (inject links) it seems to work:

Paste this &lt;a class="dark" href="<a href="http://example.com/data.xml">http://example.com/data.xml</a>">Download&lt;/a&gt; into your text editor.

This works because " is not a valid URL char and detects http://example.com/data.xml as the exact URL limit.

But... what if the original text had also a link inside the link text? This is a very common case scenario. Like this original text:

Paste this <a class="dark" href="http://example.com/data.xml">http://example.com/data.xml</a> into your text editor.

Then applying #2 would give this:

Paste this &lt;a class="dark" href="http://example.com/data.xml"&lt;http://example.com/data.xml&lt;/a&gt; into your text editor.

HERE WE HAVE A PROBLEM

As all of &, ; and / are valid URL characters, the URL parser would find this: http://example.com/data.xml&lt;/a&gt; as the URL instead of ending at the .xml point.

This would result in this wrong output:

Paste this &lt;a class="dark" href="<a href="http://example.com/data.xml">http://example.com/data.xml</a>"&lt;<a href="http://example.com/data.xml&lt;/a&gt;">http://example.com/data.xml&lt;/a&gt;</a> into your text editor.

So http://example.com/data.xml&lt;/a&gt; got substituted by <a href="http://example.com/data.xml&lt;/a&gt;">http://example.com/data.xml&lt;/a&gt;</a> but the problem is that the URL was not correctly detected.

Let's mix it up with rule #1

If rules #2 and #3 are a mess when processed together imagine if we mix them with rule #1 and we have a URL which contains a sha-1 like this database entry:

Paste this <a class="dark" href="http://example.com/id/89019b16ab155ba1c19e1ab9efdb9134c8f9e2b9">http://example.com/id/89019b16ab155ba1c19e1ab9efdb9134c8f9e2b9</a> into your text editor.

Could you imagine??

Tokenizer?

I have thought of creating a syntax tokenizer. But I feel it's an overkill.

Is there a design-pattern

I wonder if there's a design-pattern to read and study, how is it called, and where is it documented, when it comes to do multiple text substitutions.

If there's not any pattern... then... is building a syntax tokenizer the only solution?

I feel there must be a much simpler way to do this. Do I really have to tokenize the text in a syntax-tree and then re-render by traversing the tree?

CodePudding user response:

The design pattern is the one you already rejected, left-to-right tokenisation. Of course, that's easier to do in languages for which there are code generators which produce lexical scanners.

There's no need to parse or to build a syntax tree. A linear sequence of tokens suffices. In effect, the scanner becomes a transducer. Each token is either passed through unaltered, or is replaced immediately with the translation required.

Nor does the tokeniser need to be particularly complicated. The three regular expressions you currently have can be used, combined with a fourth token type representing any other character. The important part is that all patterns are tried at each point, one is selected, the indicated replacement is performed, and the scan resumes after the match.

  • Related