String | Regex | result | row |
---|---|---|---|
ccaaabb | [^a] | cc | 1 |
aaabb | [^a] | bb | 2 |
AbbbAcc | [^A] | b | 3 |
AbbbAcc | [^A]? | (empty) | 4 |
AbbbAcc | [^A]* | (empty) | 5 |
AbbbAcc | ([^A] ){1} | bbb | 6 |
AbbbAcc | ([^A] ){2} | b | 7 |
row 1: From left to right, cc is matched and output. bb is not. why?
row 2: the engine seems to skip every a then reached bb matched and output. in row 1, the a's are not skipped and the b's are not reached. why such difference?
row 3 -5 : why such difference?
row 6 -7 : why such difference?
Please explain, character by character and space by space, the working process of the engine.
The result is obtained from testing the strings and regex on https://onlinetexttools.com/extract-regex-matches-from-text
CodePudding user response:
A regular expression matches a contiguous substring. So it starts matching the first
c
and stops when it gets toa
, and the match is justcc
.bb
is not included because it's separated fromcc
.It skips over the
a
characters because they don't match, and matchesbb
after it.It matches the first character that isn't
A
, so it matches the firstb
.Since you've made the pattern optional with
?
, it will match an empty string. It matches the empty string at the beginning of the string, and returns an empty match.*
means zero or more matches of the pattern. There are zero matches at the very beginning, so it returns that empty string.
6-7 You seem to be printing what the capture group matched, not the entire regexp. When you quantify a capture group, it captures the last repetition.
In 6, the capture group only has to match 1 time, so it can capture everything that [^A]
matches, which is bbb
.
But in 7, the capture group has to match twice. If the first repetition matched bbb
, there would be nothing for the second repetition to match, so it backtracks. Now the first repetition matches bb
and the second matches b
, and the latter is the value of the capture group.
CodePudding user response:
ccbb
is not a substring ofccaaabb
, so it can't possibly be matched.It fails to match starting at position 0, so it tries starting at subsequent positions until it finally finds a match at position 3.
It fails to match starting at position 0, so it tries starting at subsequent positions until it finally finds a match at position 1.
It finds a matching substring at position 0. It doesn't need to look further.
It finds a matching substring at position 0. It doesn't need to look further.
It fails to match starting at position 0, so it tries starting at subsequent positions until it finally finds a match at position 1.
It fails to match starting at position 0, so it tries starting at subsequent positions until it finally finds a match at position 1. It actually matches
bbb
, notb
as you claim.In the first attempt,
[^A]
first matchesbbb
. But then[^A]
can't match a second time. So it tries matching less.In the second attempt,
[^A]
first matchesbb
thenb
. The last match is what's captured.