Home > Software engineering >  Regex python ignore word followed by given character
Regex python ignore word followed by given character

Time:09-23

I have the regex (?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z] [A-Za-z0-9-_] )(?!\w).

Given the string @first@nope @second@Hello @my-friend, email@ [email protected] @friend, what can I do to exclude the strings @first and @second since they are not whole words on their own ? In other words, exclude them since they are succeeded by @ .

CodePudding user response:

You can use

(?<![a-zA-Z0-9_.-])@(?=([A-Za-z] [A-Za-z0-9_-]*))\1(?![@\w])
(?a)(?<![\w.-])@(?=([A-Za-z][\w-]*))\1(?![@\w])

See the regex demo. Details:

  • (?<![a-zA-Z0-9_.-]) - a negative lookbehind that matches a location that is not immediately preceded with ASCII digits, letters, _, . and -
  • @ - a @ char
  • (?=([A-Za-z] [A-Za-z0-9_-]*)) - a positive lookahead with a capturing group inside that captures one or more ASCII letters and then zero or more ASCII letters, digits, - or _ chars
  • \1 - the Group 1 value (backreferences are atomic, no backtracking is allowed through them)
  • (?![@\w]) - a negative lookahead that fails the match if there is a word char (letter, digit or _) or a @ char immediately to the right of the current location.

Note I put hyphens at the end of the character classes, this is best practice.

The (?a)(?<![\w.-])@(?=([A-Za-z][\w-]*))\1(?![@\w]) alternative uses shorthand character classes and the (?a) inline modifier (equivalent of re.ASCII / re.A makes \w only match ASCII chars (as in the original version). Remove (?a) if you plan to match any Unicode digits/letters.

CodePudding user response:

Another option is to assert a whitespace boundary to the left, and assert no word char or @ sign to the right.

(?<!\S)@([A-Za-z] [\w-] )(?![@\w])

The pattern matches:

  • (?<!\S) Negative lookbehind, assert not a non whitespace char to the left
  • @ Match literally
  • ([A-Za-z] [\w-] ) Capture group1, match 1 chars A-Za-z and then 1 word chars or -
  • (?![@\w]) Negative lookahead, assert not @ or word char to the right

Regex demo

Or match a non word boundary \B before the @ instead of a lookbehind.

\B@([A-Za-z] [\w-] )(?![@\w])

Regex demo

  • Related