I am attempting to use regex to match a pattern where we have some X (any character) which occurs exactly n many times in succession. I know a little about regex, but don't know of anything like this.
My previous attempts left me using (.) as a capture group for my X, but I wasn't able to find a way to make sure this happened exactly n times (no more, and no less)
(Edit) For more context, I am trying to separate strings (containing only the letters 'r', 'p', and 's') into either "human" or "machine" generated and I want to assume that any string which has "XrrrrX" (where X is either s or p) or "YssssY" (where Y is either r or p) or "ZppppZ" (where Z is either s or r).
Some sample examples are
psrsrprrsssrrrpsprprsppspsssrsrssrpprppsrpssrp
psrpsprpsrpprpsprpsprpsrpprppsrpsprsprsprppsrp
psrrrrsprsrpsrrsprrrrrprpssssrsprrpspspppprpsr
rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
where I want to match only strings that have at most 5 of any character in a row and also at least one occurrence of xxxxx (where x is any character repeated 5 times in a row)
CodePudding user response:
You need to use a back-reference to your capture group. Here is an example regex:
(.)\1{2}
Regex explained:
(.)
is a capture group that captures literally anything a single time\1
is a back-reference to the group you just captured (that single character){2}
is a quantifier, which matches the previous token (the\1
) exactly twice.
Note that, to capture a single character n times, you have to specify {n - 1}
as the quantifier because the first match was already captured by (.)
.
CodePudding user response:
I believe the following should do what you'r after:
^(?!([psr]*?([psr])\2{4})\2)(?1)(?2)*$
See an online demo
^
- Start-line anchor;(?!
- Open a negative lookahead;([psr]*?([psr])\2{4})\2)
- A nested 1st capture group to match 0 (Lazy) characters of[psr]
upto another nested 2nd capture group of any of those characters. Right after is a backreference to the 2nd capture group which we match 4 times. After that we match the content of the 2nd capture group once more to avoid it occur 6 times before we close the negative lookahead;
(?1)
- Match the same subpattern as the 1st group;(?2)*
- Match the same subpattern as the 2nd group 0 (Greedy) times upto;$
- End-line anchor.
I suppose this would be short for something like:
^(?![psr]*?([psr])\1{5})[psr]*?([psr])\2{4}[psr]*$
CodePudding user response:
I have assumed you wish to match any of the following strings:
'psssssp'
'sppppps'
'rsssssr'
'srrrrrs'
'prrrrrp'
'rpppppr'
(but I have a sneaking suspicion that you actually want to do something else, in which case please tell me in a comment).
That could be done by matching a simple alternation:
(?:psssssp|sppppps|rsssssr|srrrrrs|prrrrrp|rpppppr)
Alternatively, you could use a regular expression that employs back-references:
([psr])(?!\1)([psr])\2{4}\1
Demo
The second regular expression has the following components.
([psr]) # match 'p', 's' or 'r' and save to capture group 1
(?!\1) # the next character cannot be the content of capture group 1
([psr]) # match 'p', 's' or 'r' and save to capture group 2
\2{4} # match the content of capture group 2 four times
\1 # match the content of capture group 1
(?!\1)
is a negative lookahead.