Home > Back-end >  Regex to get each item in an xpath-like string, with each subscript as a group
Regex to get each item in an xpath-like string, with each subscript as a group

Time:10-23

I'd like to take an xpath-like string such as:

a.b.c[2].d[123].e1[4].f88[5]

And have each path-part as a match, with each subscript ("array index") as a group, like this:

match 1: a
match 2: b
match 3: c, group 1: 123
match 4: e1, group 1: 4,
match 5: f88, group 1: 5

I tried with the following (which doesn't work):

[^.] (?:\[)*([0-9] )*(?:\])*

As I understand this Regex, it means:

  1. First, match all characters except for a dot
  2. Then, check (but don't capture) for a left square bracket - it may be present 0 to unlimited times.
  3. Then, check for any number, with length 1 to unlimited - and capture as a group.
  4. Then, do 2 again for a right square brack.

But it doesn't work.

How can I make it work?

CodePudding user response:

[^.] (?:\[)*([0-9] )*(?:\])*

"But it doesn't work" because is greedy and consumes all the characters before the dot. Furthermore, each subscript is integrally optional, rather than part by part.

Applying those criteria, this expression does work:

([^.\[] )(?:\[(\d )\])?

Regex101 Test

CodePudding user response:

The pattern that you tried:

  • The pattern that you tried matches too much, as the negated character class [^.] matches 1 or more times any char except a dot, and can also match square brackets.

  • Note that this notation (?:\[)* is the same as \[* and matches 0 or more times an opening square bracket

If the \G anchor is supported, and you want to match the example string only from the start of the string, you might use 2 capture groups for the data that you want, and match the dots and square brackets in between.

\G([^\][.\s] )(?:\[(\d )\])?\.?

The pattern matches:

  • \G Assert the position at the end of the previous match, or at the start of the string
  • ([^\][.\s] ) Capture group 1, match 1 char other than ] [ . or a whitespace char (as there do not seem to be any spaces in the example string)
  • (?:\[(\d )\])? Optionally match capture group 2 between matching square brackets
  • \.? Match an optional dot to continue the consecutive matching for the \G anchor

Regex demo

If there can not be a dot at the end of the string, and there must be at least 1 dot present, you can assert the whole format first from the start of the string:

(?:^(?=[^.] (?:\.[^.] ) $)|\G(?!^))\.?([^\][.] )(?:\[(\d )\])?

Regex demo

  • Related