RegEx group characters into a word-CodePudding

I want to tell RegEx to match/not match when a set of characters exist all together in the format i design (Like a word) and not as seperate characters. (Using JavaScript for this particular example)

I am making a RegEx for Discord IDs following the rules set in https://discord.com/developers/docs/resources/user and heres what ive got so far:

/^(.)[^#@] [#][0-9]{4}$/

For those who dont want to open the page, the rule is:

1-in the first part can contain (any number of) any characters except #, @, and '''(the third is not added yet).

2- second part can only be a # character.

3- third part should a 4 digit number.

All works except when i want my regex to allow ', '' or even '''''' but not ''', therefore only the entire "word" or set of characters is found. How can i make it work ?

Edited:

Adding this since the question seems to be vague and cause confusion, the answer to the main question would be to add a lookahead ((?!''')) of the word you want to exclude to the part of the regex you want. Yet for '''''' to be allowed as ive asked in my question, since '''''' does include ''' in itself, its no longer a matter of finding the word, but also checking for what it comes before/after it, in which case the accepted answer is correct.

I explained my real situation but other examples would be for it to allow @ and # but not @#.

(also for those wondering i changed the ``` character set, defined by discord devs to ''' because the latter would have interfered with stack overflow codes. and the length is being controlled via JS not regex, and im ignoring spaces for the sake of simplicity in this case.)

CodePudding user response：

This should suit your needs:

^('(?!'')|[^#@']) #\d{4}$

The first part was your issue, '(?!'')|[^#@'] means:

either ' if not followed by ''
or any char except #, @ and ' (as already handled above)

See demo.

For the sake of completeness, the following will forbid any multiple of 3 consecutive ', so ''', '''''', etc.:

'(?!'')|'''(?=')|[^#@']

'''(?='): ''' as long as followed by another '

See demo.

The following will forbid exactly 3 consecutive ', but will allow any other occurrence (including '''''' for example):

'(?!'')|'''' |[^#@']

'''' : four or more ' (could be rewritten '{4,})

See demo.

CodePudding user response：

To not allow matching only 3 occurrences of ''' and the lookbehind support is available, you might use a negative lookahead.

The single capture group at the start (.) can be part of the negated character class [^#@\n] if you don't want to reuse its value for after processing.

^(?!.*(?<!')'''(?!'))[^#@\n] #[0-9]{4}$

Regex demo

^ Start of string
(?!.*(?<!')'''(?!')) Negative lookahead, assert not 3 times a ' char that are not surrounded by a '
[^#@\n] Match 1 times any char except the listed
#[0-9]{4} match # and 4 digits
$ End of string

Note that this char [#] does not have to be in a character class, and if you don't want to cross newlines, you can add \n to the character class.

CodePudding user response：

Keep in mind that, while regexes can be very entertaining, in practice an extremely complex regex is usually a sign that someone got fixated on regex and didn't consider an easier approach.

Consider this advice from Jeff Atwood:

Regular expressions are like a particularly spicy hot sauce – to be used in moderation and with restraint only when appropriate. Should you try to solve every problem you encounter with a regular expression? Well, no. Then you'd be writing Perl, and I'm not sure you need those kind of headaches. If you drench your plate in hot sauce, you're going to be very, very sorry later.

...

Let me be very clear on this point: If you read an incredibly complex, impossible to decipher regular expression in your codebase, they did it wrong. If you write regular expressions that are difficult to read into your codebase, you are doing it wrong.

I don't know your situation, but it sounds like it would be much easier to look for a bad ID then to try and define a good ID. If you can break this into two steps, then the logic will be easier to read and maintain.

Verify that the final part of the ID is as expected (/#\d{4}/)
Verify that the first part of the ID does not have any invalid characters or sequences

function isValid(id) {
  const idPrefix = /(. )#\d{4}/.exec(id)?.[1];
  if (idPrefix === undefined) return false; // The #\d{4} postfix was missing
  // If we find an illegal character or sequence, then the id is not valid:
  return !(/[#@]|(^|[^'])(''')($|[^'])/.test(idPrefix));
}

That second regex is a bit long, but here's how it breaks down:

If the Id contains a # or @ then it's not legal.
Check for a sequence of ''' that IS NOT surrounded by a fourth '. Also take the beginning and ending of he string into account. If we found a sequence of exactly three ', then it's not legal.

The result:

isValid("foobar#1234") // true
isValid("f#obar#1234") // false
isValid("f@obar#1234") // false
isValid("f''bar#1234") // true
isValid("f'''ar#1234") // false
isValid("f''''r#1234") // true