improve performance of regular express to match prefix/suffix only when middle is a match-CodePudding

I have a regular expression that is mean to identify if a domain name is in a body of text:

/.domain\.com\/path/

If found, I want to also capture its full URL which may or may not exist. So I've added the following to the start/end:

/[^\s'"]*\.domain\.com\/path[^\s'"]*\/ to capture things like https://subdomain.domain.com/path/path/?asdf#char

Is there a more performant way to do this so that the entire document isn't scanning for [^\s'"] unless .domain.com/path is present?

is slow because of the [^\s'"] character class at the beginning and end. How could I improve the performance?

I chose \s and ' and " since whitespace and single/double quotes would indicate a URL string has started/ended.

CodePudding user response：

One way to improve the performance of this regular expression would be to use the "lazy" quantifier, *? instead of *. This would make the regular expression look for the first instance of a whitespace, single or double quote after the .domain.com/path match, instead of searching for all instances of those characters after the match.

So instead of:

 /[^\s'"]*\.domain\.com\/path[^\s'"]*\/

you could use:

 /[^\s'"]*?\.domain\.com\/path[^\s'"]*?\/

The *? quantifier will match as few characters as possible, which will make the regular expression stop searching for the ending whitespace, single or double quote as soon as it finds one. This can greatly improve performance, especially if the text you are searching has many instances of those characters.

Another way to improve performance is to make use of positive lookaheads and lookbehinds. These constructs let you match a pattern without consuming any characters, so you can check for certain conditions before or after a match.

For example, you can use positive lookahead (?=...) before the .domain.com/path match to check if it is preceded by a certain pattern, or use positive lookbehind (?<=...) after the .domain.com/path match to check if it is followed by a certain pattern.

This way, you can ensure that the regular expression only matches the target pattern when it is preceded or followed by a specific pattern, which will improve performance and make your search more specific.

CodePudding user response：

Well assuming you expect each URL to begin with either http or https, you could use:

/https?:\/\/[^\s'"]*\.domain\.com\/path[^\s'"]*\/

This regex would only bother checking the domain of any substring which is a bona-fide URL.