Home > Software engineering >  How To Combine These 2 Regexp in Javascript
How To Combine These 2 Regexp in Javascript

Time:05-05

I wrote a Javascript routine that, given a hostname or a URL, it finds the root domain.

function getRootDomain(s){
  var sResult = ''
  try {
    sResult = s.match(/^(?:.*\:\/?\/)?(?<domain>[\w\-\.]*)/).groups.domain
      .match(/(?<root>[\w\-]*(\.\w{3,}|\.\w{2}|\.\w{2}\.\w{2}))$/).groups.root;
  } catch(ignore) {}
  return sResult;
}

What is the technique to combine the two regex rules into one rule?

I used this tutorial to try to advance my existing RegExp experience over the years, although I've never really understood lookbehinds and lookaheads (which might be useful here?), and then used the great tool at RegEx101.com for trial and error. What I tried was to stick what's after <root> to replace what comes after <domain>, and variations on that, and all failed.

CodePudding user response:

The second regexp uses the $ assertion to only match the end of the .domain capture.

The first RegExp, however stops matching after the domain (when it meets either a /, a ?, a #, a : or the end of the string if there is no path, query string or hash parts. So you can't just reuse the $ assertion, it would fail in some cases.

To combine both parts, you could replace the domain capture with this:

.*?(?<root>[\w\-]*(\.\w{3,}|\.\w{2}|\.\w{2}\.\w{2}))(?:[\/?#:]|$)

(?:[\/?#]|$) at the end is a non-capturing group that matches either the target characters or the end of the string.

.*? frugally matches anything. That is, it first tries to match the root capture followed by (?:[\/?#]|$). Every time that fails, it eats one character and tries again, letting you search for the root.

Also:

  • you can combine \.\w{3,}|\.\w{2} into just \.\w{2,}.

  • you can use a non-capturing group around the TLDs ((?:...) vs (...).

  • It would be better to use .*? to get the protocol, or you could end up globbing too much (with a greedy .*, passing https://example.com/#://bar.com would return bar.com).

  • You don't need to escape the :. In unicode mode, that escape is actually a syntax error.

Resulting into

const x = /^(?:.*?:\/\/?)?.*?(?<root>[\w\-]*(?:\.\w{2,}|\.\w{2}\.\w{2}))(?:[\/?#:]|$)/

I actually wrote a RegExp builder that may help you get further in your RegExp learning journey... Here's your RegExp ported to compose-regexp

  • Related