Home > OS >  regexp to extract an URL components
regexp to extract an URL components

Time:04-11

I am trying to write a regexp to extract an URL components. The syntax can be found here: RFC 3986.

Some of the components are optional. So far I have:

(. )://((.*)@)?(. ?)(:(\d*))?/((.*)\?)?((.*)#)?(.*)

The decomposition is:

  • (. ):// matches the scheme followed by ://. Not optional.
  • ((.*)@)? matches the user information part of authority. Optional.
  • (. ?) matches the host. Not optional. There is an issue here where this group will also match the optional port.
  • (:(\d*))? should matches the port.
  • / this and all that follows should be made optional.
  • ((.*)\?)? matches the path part. Optional.
  • ((.*)#)? matches the query part. Optional.
  • (.*) matches the fragment part. Optional.

How can I improve this regexp so that it is RFC3986-valid ?

Fun fact: this regexp matches itself.

Example URL (taken from the RFC): foo://example.com:8042/over/there?name=ferret#nose

Edit: I forgot to escape d. Now all that's left to do is to make everything that follows the host optional, including the leading /.

CodePudding user response:

Your regular expression works fine if you just escape the slashes and preferably the colon as well. The result is (. )\:\/\/(.*@)?(. ?)(:(\d*))?\/((.*)\?)?((.*)#)?(.*). Here is a simple script to show how it can be used to filter out invalid URIs:

Update Following the comments I have made the following modification:

  • I have added (\:((\d*)\/))?(\/)*. Explanation:
    • \:((\d*) matches a colon and then any string of digits.
    • the \/ after this matches a slash which should be after this string of digits. This is because the port must not contain any other characters but digits. So they cannot be found in the port-portion of the uri.
    • Finally, the entire port-matching expression is optional, hence the ?.
    • The last part indicates that many or no slashes can follow the existing/non-existing port

Final regExp: (. )\:\/\/(.*\@)?(. ?)(\:((\d*)\/))?(\/)*((.*)\?)?((.*)\#)?(.*)

const myRegEx = new RegExp("(. )\:\/\/(.*\@)?(. ?)(\:((\d*)\/))?(\/)*((.*)\?)?((.*)\#)?(.*)", "g");

const allUris = [
  /*Valid*/ "https://[email protected]:5050/page?query=value#element", 
  /*Valid*/ "foo://example.com:8042/over/there?name=ferret#nose",
  /*Valid*/ "foo://example.com",
  /*Not valid*/ "www.example.com"];


const allowedUris = allUris.map(uri => {
  // Use the regexp to match it, then return the match
  const match = uri.match(myRegEx);
  return match;
});

console.log("Here are the valid URIs:");
console.log(allowedUris.join("\n\n")); // Should only print the first two URIs from the array.

  • Related