Home > Enterprise >  Capturing hyperlinks conditionally in regex captures too much
Capturing hyperlinks conditionally in regex captures too much

Time:08-12

I'm attempting to replace specific hyperlinks in text with the following expression:

<a .*?href=(['"]?).*?\/?download.aspx\?myid=\{?([0-9a-f]{8}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{12})\}?\1.*?>(.*?)<\/a>

Example text:

<a clicktracking=off href="https://www.test.com/1.pdf">test</a><br />Test<br />Test</div><br><div>&nbsp;</div><br><div><a clicktracking=off href="download.aspx?myID=11111111-2222-3333-4444-555555555555">Teet.docx</a></div>

My issues is that it's capturing everything between the first <a and the last </a> when I want it to capture only <a clicktracking=off href="download.aspx?myID=11111111-2222-3333-4444-555555555555">Teet.docx</a> (along with any other links that match that format).

I've tried putting in [^>] some different spots to get it to 'break out' out of the match early but had no luck. I've also tried inserting a newline after each </a> and that lets it work but it's not desirable to alter text beyond the link that I want to replace.

Regex101 Example: https://regex101.com/r/BjpO1d/1

Unsure if it matters but I'm using "JScript".

CodePudding user response:

You need to "exclude" a tag from matching with .*?:

<a\s(?:(?!<\/?a[\s>]).)*?href=(['"]?)(?:(?!<\/?a[\s>]).)*?download\.aspx\?myid=\{?([0-9a-f]{8}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{12})\}?\1.*?>(.*?)<\/a>

See the regex demo.

See (?:(?!<\/?a[\s>]).)*? which means

  • (?: - start of a non-capturing group:
    • (?!<\/?a[\s>]). - any one char other than line break chars that is not the starting point for the <\/?a[\s>] pattern: <, an optional /, a, a whitespace or >
  • )*? - end of the group, zero or more repetition, as few as possible.
  • Related