I have a list of html data that is structured as shown below
<div>lots of other data...
<a href="http://localserver1/OpenFile?path=Test1/SubFolder/file1.pdf&OtherParam=1">Test1</a>
</div>
<div>lots of other data...
<a href="http://localserver1/OpenFile?path=Test1/Subfolder/file2.pdf&OtherParam=2
</div>
<div>lots of other data...
<a href="http://localserver1/OpenFile?path=Test2/Subfolder/file3.pdf&OtherParam=3
</div>
As you can see in the second url, there is no encoding in the slashes. These links interface with a content management system (an admittedly bad one), and very frequently we get paths that are not encoded. I wanted to write a small block of code in C# that would check whether or not the blocks of html code here would have slashes and just replace them with the / encoding. I have been able to locate all instances of where the OpenFile link occurs like this:
OpenFile\?path=(.*)&
However I can't seem to find an easy way to look through the path's capture group and replace only slashes that are in there. How would I go about doing this?
CodePudding user response:
Since your example uses "&" as the end of the pattern, I will assume it is consistent for all cases.
You can use this expression:
\/(?!.*OpenFile\?path=)(?=.*&)
https://regex101.com/r/hZ3Oja/1
This uses a negative lookahead on "OpenFile?path=" and a positive lookahead on "&" so that it only replaces slashes that are a part of your inner path.
Your c# syntax will look like Regex.Replace(input, pattern, replacement);
CodePudding user response:
In C# you can use lookarounds to match the forward slash:
(?<=OpenFile\?path=[^\s&]*)/(?=[^\s&]*&)
Explanation
(?<=OpenFile\?path=[^\s&]*)
Positive lookbehind, assert the openfile part to the left followed by optional non whitespace chars excluding&
/
Match the forward slash(?=[^\s&]*&)
Positive lookahead, assert an ampersand to the right
If there can also be a match without an ampersand at the right, you can omit the last positive lookahead in the pattern.