Home > Software design >  capturing multiple instances of a pattern
capturing multiple instances of a pattern

Time:12-17

I have a string:

{value1} {value2}-{value3}*{value...n}

using a regular expression, I want to capture each of the bracketed values as well as the operators in between them and I do not know how many brackets there will be.

I tried:

/(\{.*\}).*([\ |\-|\*|\/])*/mgU

but that is just getting me the values and not the operators. Where did I go wrong?

CodePudding user response:

You can validate the string first with

/\A ({ [^{}]* }) (?: [\/ *-] (?1))* \z/x

Details:

  • \A - start of string
  • ({[^{}]*}) - Group 1: a {, any zero or more chars other than { and } and then a } char
  • (?:[\/ *-](?1))* - zero or more occurrences of a /, , * or - char and then the Group 1 pattern
  • \z - end of string.

Then, you may collect individual matches with

/ { [^{}]* } | [\/ *-] /gx

This regex matches all occurrences of any substrings between { and } (with {[^{}]*}) or /, , * or - chars (with [\/ *-]).

See a complete demo script:

#!/usr/bin/perl
use strict;
use warnings;
 
my $text = "{value1} {value2}-{value3}*{value...n}";
 
if ($text =~ /\A ({ [^{}]* }) (?: [\/ *-] (?1))* \z/x) {
    while($text =~ / { [^{}]* } | [\/ *-] /gx) {
        print "$&\n";
    }
}

Output:

{value1}
 
{value2}
-
{value3}
*
{value...n}

CodePudding user response:

Another idea might be using the \G anchor and 2 capture groups, where the curly values are in group 1 and the operator in group 2:

\G(?=.*{[^{}]*}\z)({[^{}]*})([ *\/-])?

The pattern matches

  • \G Assert the position at the end of the previous match, or at the start of the string (in this case)
  • (?=.*{[^{}]*}\z) Positive lookahead, assert that the string ends with a curly part
  • ({[^{}]*}) Capture the curly braces in group 1
  • ([ *\/-])? Optionally capture an operator in group 2

Regex demo | Perl demo

Example

my $str = "{value1} {value2}-{value3}*{value...n}";
while ($str =~ /\G(?=.*\{[^{}]*}\z)({[^{}]*})([ *\/-])?/g) {
    print "Curly value: $1 Operator: $2\n";
}

Output

Curly value: {value1} Operator:  
Curly value: {value2} Operator: -
Curly value: {value3} Operator: *
Curly value: {value...n} Operator:

CodePudding user response:

The tokenizer approach:

my @tokens;
for ($str) {
   while (1) {
      /\G \s  /xgc;

      /\G \{ ( [^{}]* ) \} /xgc
         and do { push @tokens, [ VALUE => $1 ]; next; };

      /\G ( [ -*\/] ) /xgc
         and do { push @tokens, [ OP => $1 ]; next; };

      /\G \Z /xgc
         and last;

      die( "Unexpected character at pos ".( pos )."\n" );
   }
}

It might be overkill, but it's easier to extend.

CodePudding user response:

If you only have non-nested blocks, separated by a known list of operators, you can use split to very easily separate a statement into values and operators.

use strict;
use warnings;
use Data::Dumper;

my @val = split m#([- /*])#, <DATA>;   # parens will prevent operators from being consumed
print Dumper \@val;

__DATA__
{value1} {value2}-{value3}*{valuen}/{value4} {value5}-{value6}*{valuen} {value7} {value8}-{value9}

This will print:

$VAR1 = [
          '{value1}',
          ' ',
          '{value2}',
          '-',
          '{value3}',
          '*',
          '{valuen}',
          '/',
          '{value4}',
          ' ',
          '{value5}',
          '-',
          '{value6}',
          '*',
          '{valuen}',
          ' ',
          '{value7}',
          ' ',
          '{value8}',
          '-',
          '{value9}
'
        ];

From there, it should be a simple task to validate and clean up the values, as well as identify the operators.

  • Related