Home > Enterprise >  How to retrieve the captured substrings from a capturing group that may repeat?
How to retrieve the captured substrings from a capturing group that may repeat?

Time:10-21

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.

Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w )(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".

Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.

What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C codes, so there is no split, a Perl built-in function. Thanks.

CodePudding user response:

If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use

/([^:] )/g;

which captures the list of items. How to then retrieve it depends on the language used.


The /g "modifier" (and capture groups) above refer to Perl's syntax, how the question was tagged in the beginning. In C one would use std::match_results to retrieve sub-expressions, after either regex_match or regex_search (some context in the question would help). Will update...

CodePudding user response:

Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.

If the whole string can only contain word characters separated by colons:

(?:^(?=\w (?::\w ) $)|\G(?!^):)\K\w 

The pattern matches:

  • (?: Non capture group
    • ^ Assert start of string
    • (?=\w (?::\w ) $) Assert from the current position 1 word characters and 1 repetitions of : and 1 word characters till the end of the string
    • | Or
    • \G(?!^): Assert the position at the end of the previous match, not at the start and match :
  • ) Close non capture group
  • \K\w Forget what is matched so far, and match 1 word characters

Regex demo

To allow only words as well from the start of the string, and allow other chars after the word chars:

\G:?\K\w 

Regex demo

  • Related