I'm getting a list of inputs from the user of supposely valid perl regexp values. Examples could be:
\b[Bb]anana\b
\s*Apples[BANANA]\s
Is there a safe way to validate these strings?
CodePudding user response:
First, consider how much you want to let users do with a pattern. A Perl regex can run arbitrary code.
But, to validate that you can use a string as a pattern without it causing a fatal error, you can use the qr//
operator to compile the string and return the regex. If there's a problem, the qr
gives you a fatal error that you can catch with eval
:
my $pattern = eval { qr/$input/ };
If you get back undef
, the pattern was not valid. And, despite the comments in the question, there are infinite ways to make invalid patterns. I know because I type them in by hand all the time and I haven't run out of ways to mess up :)
This does not apply the pattern to a string, but you can use $pattern
to make the match:
if( $pattern ) {
$target =~ $pattern; # or $target =~ m/$pattern/
}
CodePudding user response:
Well, validating a regular expression requires knowledge of the kind of inputs you are expecting. There's a direct relationship between the regexp operators and the sets of strings that are accepted by the automaton.
The problem here is that normally the set of strings is not well known or is badly specified, for example:
The essential set of operators in regex are the basic language character set (which provides the symbols to operate on) and the operators that make things complex: this are the alternative |
(select one alternative or the other), concatenation (there's no symbol for this, as two regexps are just put together to mean the set of strings comming from one set, and followed by a string, this time coming from the second set) and closure, symbolized by a *
(this last meaning allowing any repetition ---including none--- of strings comming from the previous set).
Absolutely all regular expressions can be handled as a (mostly, more complicated) expression that employs only these three operators and no more. For example, the
operator can be handled by repeating the regexp it is applied to, and adding the *
to the second instance (surrounding by parenthesis to group it all) The ?
optional suffix can be handled by the follwing rule (regexp)? == (regexp|)
(the alternative of using it or not)
- The
|
means an alternative... you provide two sets of strings and the result set is the union of boths sets. This means that a string will be accepted if it belongs to one or any of the sets. - The catenation means the set is constructing by evaluating the cartesian product of both sets. You need to make pairs of strings, taking one from the first set and the second from the other set, and all the possible sets are formed this way.
- The closure implies building a string as a sequence of (possibly zero) instances from the same set.... this means that you can concatenate any instances from the set is it applied.
This set of rules will give you the complete set of strings that make your regular expression. This can coincide (or not) to which you have in mind... but if what you have in mind is poorly defined, so it will be your regular expression.
So, as a conclussion, you are asking for a general procedure to test your own mind and how do you design your regular expressions. There's a theorem (called the Pumping theorem) that is used in the demonstration of the equivalence of regular expressions and finite state automata. This is a very important achievement, because it allows you to use regular expressions for efficient single pass, string recognition. If you dig on this, you will find that is possible to write a tool that, from a regular expression can build systematically the full set of strings that will be accepted by some regexp. This has a problem although, is that many of those regexp create infinite sets of strings, and this means the algorithm is not going to finish in a finite time.
As a final comment I can tell you that this makes regular expressions a very powerfull tool to select a string. You can detect with regular expressions complex things like being a string of digits that make a number that is multiple of 23 in it's decimal form, for example, or to validate the number of a credit card for transcription errors.