RegEx replace all slashes in capture group-CodePudding

I have a list of html data that is structured as shown below

<div>lots of other data...
    <a href="http://localserver1/OpenFile?path=Test1/SubFolder/file1.pdf&OtherParam=1">Test1</a>
</div>
<div>lots of other data...
    <a href="http://localserver1/OpenFile?path=Test1/Subfolder/file2.pdf&OtherParam=2
</div>
<div>lots of other data...
    <a href="http://localserver1/OpenFile?path=Test2/Subfolder/file3.pdf&OtherParam=3
</div>

As you can see in the second url, there is no encoding in the slashes. These links interface with a content management system (an admittedly bad one), and very frequently we get paths that are not encoded. I wanted to write a small block of code in C# that would check whether or not the blocks of html code here would have slashes and just replace them with the / encoding. I have been able to locate all instances of where the OpenFile link occurs like this:

OpenFile\?path=(.*)&

However I can't seem to find an easy way to look through the path's capture group and replace only slashes that are in there. How would I go about doing this?

CodePudding user response：

Since your example uses "&" as the end of the pattern, I will assume it is consistent for all cases.

You can use this expression:

\/(?!.*OpenFile\?path=)(?=.*&)

https://regex101.com/r/hZ3Oja/1

This uses a negative lookahead on "OpenFile?path=" and a positive lookahead on "&" so that it only replaces slashes that are a part of your inner path.

Your c# syntax will look like Regex.Replace(input, pattern, replacement);

CodePudding user response：

In C# you can use lookarounds to match the forward slash:

(?<=OpenFile\?path=[^\s&]*)/(?=[^\s&]*&)

Explanation

(?<=OpenFile\?path=[^\s&]*) Positive lookbehind, assert the openfile part to the left followed by optional non whitespace chars excluding &
/ Match the forward slash
(?=[^\s&]*&) Positive lookahead, assert an ampersand to the right

Regex demo

If there can also be a match without an ampersand at the right, you can omit the last positive lookahead in the pattern.