Home > other >  Password Regex - Identifying how a given regex works
Password Regex - Identifying how a given regex works

Time:07-01

I have been given a regex and I have to identify the meaning of it. I have partially understood it but using regex101 I found out it is using the positive lookahead, but I am not understanding how it's used in this specific case (since it's using the ? as well, which should meaning optional value [0 or more]).

^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-_]).{8,}$

Could somebody explain to me in a paragraph how this regex is working?

CodePudding user response:

The regex engine maintains a pointer in the string. The pointer initially points to a location before the first character of the string. Later it may be moved to locations between two successive characters or to a location that follows the last character of the string.


The anchor ^ at the beginning of the expression causes the first assertion following to match characters from the beginning of the string. If, for example, the string were 'cat', the regular expression ^a would not match the string because the first character is not 'a'. By contrast, the regex a would match the character 'a' (after failing to match 'c' and then moving the string pointer ahead one character).


Four positive lookaheads are then executed. Each is of the form

(?=.*?<match something>)

(?= and ) denote the beginning and end of a positive lookahead.

"positive" in "positive lookahead" means the enclosed assertion must be satisfied for there to be a match. "lookahead" means the assertion concerns the characters that immediately follow the current location of the string pointer.1,2

The characters matched within the positive lookahead are not part of the match returned by the engine.

The expression .* directs the engine to match zero or more characters other than line terminators (such as '\n' or '\r\n'). This is followed by ?, a modifier that instructs the engine to perform the match lazily; that is, match as few characters as possible. Without the ? the engine would perform a greedy match: match as many characters as possible. By way of example, if the string were 'axbcdxe', .*x would match 'axbcdx' whereas .*?x would match 'ax'.


The first positive lookahead is

(?=.*?[A-Z])

[A-Z] is a character class that asserts that the next character is one of the characters in the class. A-Z denotes a range comprised of all 26 uppercase letters. [A-Z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ]. This positive lookahead requires the engine to match zero or more characters and then match an uppercase letter. (Because the match is lazy it will match zero or more characters other than uppercase letters and then match the first uppercase letter in the string.) That is, this positive lookahead requires the string to contain at least one uppercase letter.

Executing a positive lookahead that is preceded by the anchor ^ does not cause the string pointer to move forward. Therefore, when successive positive lookaheads are executed the string pointer will remain immediately before the first character of the string.


The successive positive lookaheads are as follows:

(?=.*?[a-z])
(?=.*?[0-9])
(?=.*?[#?!@$%^&*-_])

These respectively require that the string contain:

  • a lowercase letter;
  • a digit; and
  • one of the letters in the character class3

The remainder of the regular expression is

`.{8,}$`

.{8,} asserts that 8 or more characters other than line terminators be matched. $ matches the end of the string. In other words, this requires that the string is a single line that contains at least 8 characters.

1. A "negative lookahead" ((?! ... )) requires that the enclosed assertion not be satisfied for there to be a match. "Positive lookbehinds" and "negative lookbehinds" are similar to their corresponding lookaheads except they apply to characters immediately preceding the string pointer. The four kinds of assertions are collectively referred to as "lookarounds". Different regex engines may support all, some or none of the four types of lookarounds, or impose limitations on certain lookarounds (such as supporting fixed- but not variable- length lookbehinds).

2. The positive lookahead could have been written (?=.*[A-Z]). That would match characters at the beginning of the string greedily until the last capital letter was reached. A lazy match was adopted merely to speed calculations.

3. *-_ denotes the range of ASCII characters between '*' and '_', inclusive, namely, "*", " ", ",", "-", ".", "/", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ":", ";", "<", "=", ">", "?", "@", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "[", "\", "]", "^", "_". That may be an error. If the author meant to merely include a hyphen in the character class it would have to appear at the beginning or end of the string of characters: [-#?!@$%^&*_] or [#?!@$%^&*_-].

CodePudding user response:

The question mark can be used as a quantifier (match 0 or more) and is used in quite a few group definitions. In this case though, other than the groups, it's being used to make a quantifier lazy. By default, quantifiers are greedy meaning they will match as much as they possibly can while still matching the pattern. Lazy on the other hand will match as little as it possibly can.

Example:

Say we have some text:

This text "has" some "quotes" in it.

Our goal is to match everything within the quotes. The simple expression of ".*" looks like it would do the trick right? When using this expression we get the following match:

This text "has" some "quotes" in it.

Hmm that's not quite right. It's matching everything from the very first quote, to the very last quote. This is because the quantifier is greedy. Now, let's make it lazy by adding a question mark to it (".*?"). Now we get these matches:

This text "has" some "quotes" in it.

Now that looks much better.

  • Related