Home > Net >  Remove everything except email addresses from text using REGEX only
Remove everything except email addresses from text using REGEX only

Time:11-04

EDIT: I have no access to "replace" function, to any code, or to the REGEX matches. All I can do is provide a regex string to the API, and it strips out whatever was matched (not part of an email), and leaves the rest (leaving behind only emails).


I am working with an API that reads data from an OCR document. I have no control over the API, however I have access to a function in the API which can strip out whatever is matched by a provided REGEX. I am trying to strip out whatever is NOT an email address, leaving only the email addresses behind, separated by spaces if there is more than one email. I know REGEX isn't the best for matching emails, but I have no other choice here.

Thanks to the OCR document, there are often characters that should not be present in an email e.g the text could be (simple example) User Email:[email protected]*required field and I would like to end up with just [email protected] by stripping out the rest.

  1. I can't define or use regex replace or any other functions. All I can do is define a regex for what to strip off (basically I need to invert an email match).
  2. I certainly don't expect this to work for all RFC-compliant email addresses, just reasonably most use-cases.
  3. In case it matters, I happen to know the architecture of the API is in C#

Here is what I tried (non-working) to use to invert the email match, but it doesn't match anything.

^(?![A-Z0-9._% -] @[A-Z0-9.-] \.[A-Z]{2,4}(?!.)

I also searched SO and found this link but it was inconclusive.

CodePudding user response:

This works in C#, uses variable look behind.

(?i)(?:(?<=([A-Z0-9._% -] @[A-Z0-9.-] \.(?:[A-Z]{2,3}(?![A-Z])|[A-Z]{4})(?!\.[A-Z]{2})))|^)((?:(?![A-Z0-9._% -] @[A-Z0-9.-] \.[A-Z]{2,4})[\S\s]) )

RegexStormSample

Left a couple captures to fully view the parts.
Did a few tweaks in the lookbehind because it looks like in C#, lookbehind ranges are treated with non-greedy bias.
And they have to be controled with extra sub assertions to make it grab all the sub domain.

 (?i)
 (?:
    (?<=
       (                             # (1 start)
          [A-Z0-9._% -]  @ [A-Z0-9.-]  \.
          (?:
             [A-Z]{2,3} 
             (?! [A-Z] )
           | [A-Z]{4} 
          )
          (?! \. [A-Z]{2} )
       )                             # (1 end)
    )
  | ^
 )
 (                             # (2 start)
    (?:
       (?! [A-Z0-9._% -]  @ [A-Z0-9.-]  \. [A-Z]{2,4} )
       [\S\s]
    ) 
 )                             # (2 end)

CodePudding user response:

You can also use a negative lookbehind pattern like

(?s)(?<![\w.% -] @[\w.-] \.[A-Za-z]{0,3}(?=[A-Za-z])|[\w.% -] @[\w.-]*(?=[\w.-]*\.[A-Za-z]{2,3})|(?=[\w.% -]*@[\w.-]*\.[A-Za-z]{2,3})).

See the .NET regex demo.

Details:

  • (?s) - now, . matches line feed chars
  • (?<! - start of a negative lookbehind, the following patterns - if matched - will fail the match:
    • [\w.% -] @[\w.-] \.[A-Za-z]{0,3}(?=[A-Za-z])| - one or more word, ., %, or - chars, @, one or more word, . or - chars, ., zero to three letters that are followed with a letter, or
    • [\w.% -] @[\w.-]*(?=[\w.-]*\.[A-Za-z]{2,3})| - one or more word, ., %, or - chars, @, zero or more word, . or - chars followed with zero or more word, . or - chars, ., two or three letters, or
    • (?=[\w.% -]*@[\w.-]*\.[A-Za-z]{2,3}) - a position immediately followed with zero or more word, ., %, or - chars, @, zero or more word, . or - chars, ., two or three letters -) - end of the negative lookbehind
  • . - any 1 char.
  • Related