Home > database >  Regex to find substring between 2 Strings excluding a specific String
Regex to find substring between 2 Strings excluding a specific String

Time:11-10

I have checked all the existing questions on Stackoverflow but I couldn't find the perfect answer to it and need your help.

So basically I have multiple Strings containing different formats of URL in different ways, for eg:-

1:

<p><a href='https://abcd.com/sites/WG-ProductManagementTeam/FunctionalSpecs/Forms/AllItems.aspx?id=/sites/WG-ProductManagementTeam/FunctionalSpecs/DevDOC/Enhancements to PA Peer Checklist/PA Peer Checklist (V2.3) -v10.0.pdf&amp;parent=/sites/WG-ProductManagementTeam/FunctionalSpecs/DevDOC/Enhancements to PA Peer Checklist&amp;p=true&amp;ga=1'>WG-Product Management Team - PA Peer Checklist (V2.3) -v10.0.pdf - All Documents (sharepoint.com)</a></p>

2:

https://abcd.com/sites/WG-ProductManagementTeam/FunctionalSpecs/Forms/AllItems.aspx?id=/sites/WG-ProductManagementTeam/FunctionalSpecs/DevDOC/Enhancements to PA Peer Checklist/PA Peer Checklist (V2.3) -v10.0.pdf&parent=/sites/WG-ProductManagementTeam/FunctionalSpecs/DevDOC/Enhancements to PA Peer Checklist&p=true&ga=1

3:

https://abcd.com/:b:/r/sites/WG-ProductManagementTeam/FunctionalSpecs/DevDOC/Enhancements to PA Peer Checklist/PA Peer Checklist (v2.0) - v3.0.pdf?csf=1&web=1&e=txs2Yq

I want to extract a part of URL like this:- /DevDOC/....../.pdf

as you can see above shared 3 URL strings are all different but I am not able to find the most efficient way to resolve this.

I need to do it in such a way that it works for every type of URL string even though formats are different it should extract it from any and every String in same way.

Right now I am using regex: "./FunctionalSpecs(?!.\1)(.*?)(.pdf)" and it is working for URL 2 and 3 shared above but in case of URL 1 it is returning:

/DevDOC/Enhancements to PA Peer Checklist&p=true&ga=1'>WG-Product Management Team - PA Peer Checklist (V2.3) -v10.0.pdf

which is incorrect, I wanted this:

/DevDOC/Enhancements to PA Peer Checklist/PA Peer Checklist (V2.3) -v10.0.pdf

Please help me resolve this as soon as possible as It seems so easy but I am not able to do it in an efficient way.

Also, I am trying to do it in Java.

Any help is highly appreciated. Thank you.

CodePudding user response:

You can either decode and then use:

 `/DevDOC/[^\.] \.pdf`

Or without decoding you might want to use:

DevDoc[^\.] pdf

I'm relying here on the existence of a period before the .pdf, as the regex should keep going until first appearance of a period. If that doesn't work you might want to use [^"] .

CodePudding user response:

you can use decodeURIComponent to decode your url and then you can extract your value like below.

var url = decodeURIComponent("your encoded url string");
console.log(url.match(/DevDOC[\s\S]*\.pdf/i));
  • Related