I have a RegEx that does not match using PHP (nor regex101.com) but does with MariaDB. Its purpose is to search for HTML classes in XML values (the HTML is encoded).
Here is an example XML value where you can see a <ul>
element with a liste--non-ordonnee--gros--exergue
CSS class:
<ul class="liste--non-ordonnee--gros--exergue">
I want the RegEx to match only full classes. Therefore if I search --exergue
I don't want it to match.
Using PHP or other PCRE/PCRE2 online tester it does not match:
~(class="(?:[^&]*\s)?)--exergue~sU
But using MariaDB (v10.2.40 - PCRE 8.42), it matches:
(?sU)(class="(?:[^&]*\s)?)--exergue
It looks for a class
attribute containing the class to replace. I tried to change the class name to something else for demonstration purposes (searching --suffix
in class-with--suffix
) but then it would not match anymore on the MariaDB version.
What is wrong with my RegEx or its MariaDB version ?
I am aware that regular expressions should not be used with HTML and am open to alternatives but this is TYPO3: storing encoded HTML into XML values in a db column. Design changes require massive class renaming.
CodePudding user response:
First, the very short MCVE of your case:
SELECT 'class="s--e' REGEXP '(?sU)(class="(?:[^&]*\s)?)--e'
MariaDB matches, PHP doesn't match (demo). Why? As in MariaDB's manual for REGEXP:
Note: Because MariaDB uses the C escape syntax in strings (for example, "\n" to represent the newline character), you must double any "
\
" that you use in your REGEXP strings.
Your issue is caused by the \s
that should be double-escaped as \\s
in your MariaDB query. Once you fix that, the PHP and SQL regex statements become equivalent and behave the same.
This won't match (with --e
not led by space):
SELECT 'class="s--e' REGEXP '(?sU)(class="(?:[^&]*\\s)?)--e';
This will match (with --e
led by space):
SELECT 'class="s --e' REGEXP '(?sU)(class="(?:[^&]*\\s)?)--e';
CodePudding user response:
This matches with PHP preg
8.0.x, within 169 steps (there still might be room for improvement):
.*\sclass="(?:.*)?".*
The XML would have been relevant to match more accurately, compared to a single occurrence. This means, you've removed the sample data and therefore may have misrepresented the problem.