I'm attempting to replace specific hyperlinks in text with the following expression:
<a .*?href=(['"]?).*?\/?download.aspx\?myid=\{?([0-9a-f]{8}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{12})\}?\1.*?>(.*?)<\/a>
Example text:
<a clicktracking=off href="https://www.test.com/1.pdf">test</a><br />Test<br />Test</div><br><div> </div><br><div><a clicktracking=off href="download.aspx?myID=11111111-2222-3333-4444-555555555555">Teet.docx</a></div>
My issues is that it's capturing everything between the first <a
and the last </a>
when I want it to capture only <a clicktracking=off href="download.aspx?myID=11111111-2222-3333-4444-555555555555">Teet.docx</a>
(along with any other links that match that format).
I've tried putting in [^>]
some different spots to get it to 'break out' out of the match early but had no luck. I've also tried inserting a newline after each </a>
and that lets it work but it's not desirable to alter text beyond the link that I want to replace.
Regex101 Example: https://regex101.com/r/BjpO1d/1
Unsure if it matters but I'm using "JScript".
CodePudding user response:
You need to "exclude" a
tag from matching with .*?
:
<a\s(?:(?!<\/?a[\s>]).)*?href=(['"]?)(?:(?!<\/?a[\s>]).)*?download\.aspx\?myid=\{?([0-9a-f]{8}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{12})\}?\1.*?>(.*?)<\/a>
See the regex demo.
See (?:(?!<\/?a[\s>]).)*?
which means
(?:
- start of a non-capturing group:(?!<\/?a[\s>]).
- any one char other than line break chars that is not the starting point for the<\/?a[\s>]
pattern:<
, an optional/
,a
, a whitespace or>
)*?
- end of the group, zero or more repetition, as few as possible.