Home > Net >  Regex to match X repeated exactly n times in a row
Regex to match X repeated exactly n times in a row

Time:02-18

I am attempting to use regex to match a pattern where we have some X (any character) which occurs exactly n many times in succession. I know a little about regex, but don't know of anything like this.

My previous attempts left me using (.) as a capture group for my X, but I wasn't able to find a way to make sure this happened exactly n times (no more, and no less)

(Edit) For more context, I am trying to separate strings (containing only the letters 'r', 'p', and 's') into either "human" or "machine" generated and I want to assume that any string which has "XrrrrX" (where X is either s or p) or "YssssY" (where Y is either r or p) or "ZppppZ" (where Z is either s or r).

Some sample examples are

psrsrprrsssrrrpsprprsppspsssrsrssrpprppsrpssrp
psrpsprpsrpprpsprpsprpsrpprppsrpsprsprsprppsrp
psrrrrsprsrpsrrsprrrrrprpssssrsprrpspspppprpsr
rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

where I want to match only strings that have at most 5 of any character in a row and also at least one occurrence of xxxxx (where x is any character repeated 5 times in a row)

CodePudding user response:

You need to use a back-reference to your capture group. Here is an example regex:

(.)\1{2}

Regex explained:

  • (.) is a capture group that captures literally anything a single time
  • \1 is a back-reference to the group you just captured (that single character)
  • {2} is a quantifier, which matches the previous token (the \1) exactly twice.

Note that, to capture a single character n times, you have to specify {n - 1} as the quantifier because the first match was already captured by (.).

CodePudding user response:

I believe the following should do what you'r after:

^(?!([psr]*?([psr])\2{4})\2)(?1)(?2)*$

See an online demo


  • ^ - Start-line anchor;
  • (?! - Open a negative lookahead;
    • ([psr]*?([psr])\2{4})\2) - A nested 1st capture group to match 0 (Lazy) characters of [psr] upto another nested 2nd capture group of any of those characters. Right after is a backreference to the 2nd capture group which we match 4 times. After that we match the content of the 2nd capture group once more to avoid it occur 6 times before we close the negative lookahead;
  • (?1) - Match the same subpattern as the 1st group;
  • (?2)* - Match the same subpattern as the 2nd group 0 (Greedy) times upto;
  • $ - End-line anchor.

I suppose this would be short for something like:

^(?![psr]*?([psr])\1{5})[psr]*?([psr])\2{4}[psr]*$

CodePudding user response:

I have assumed you wish to match any of the following strings:

'psssssp'
'sppppps'
'rsssssr'
'srrrrrs'
'prrrrrp'
'rpppppr'

(but I have a sneaking suspicion that you actually want to do something else, in which case please tell me in a comment).

That could be done by matching a simple alternation:

(?:psssssp|sppppps|rsssssr|srrrrrs|prrrrrp|rpppppr)

Alternatively, you could use a regular expression that employs back-references:

([psr])(?!\1)([psr])\2{4}\1

Demo

The second regular expression has the following components.

([psr])  # match 'p', 's' or 'r' and save to capture group 1
(?!\1)   # the next character cannot be the content of capture group 1
([psr])  # match 'p', 's' or 'r' and save to capture group 2
\2{4}    # match the content of capture group 2 four times
\1       # match the content of capture group 1

(?!\1) is a negative lookahead.

  • Related