I am trying to write a regexp to extract an URL components. The syntax can be found here: RFC 3986.
Some of the components are optional. So far I have:
(. )://((.*)@)?(. ?)(:(\d*))?/((.*)\?)?((.*)#)?(.*)
The decomposition is:
(. )://
matches the scheme followed by://
. Not optional.((.*)@)?
matches the user information part of authority. Optional.(. ?)
matches the host. Not optional.There is an issue here where this group will also match the optional port.(:(\d*))?
shouldmatches the port./
this and all that follows should be made optional.((.*)\?)?
matches the path part. Optional.((.*)#)?
matches the query part. Optional.(.*)
matches the fragment part. Optional.
How can I improve this regexp so that it is RFC3986-valid ?
Fun fact: this regexp matches itself.
Example URL (taken from the RFC): foo://example.com:8042/over/there?name=ferret#nose
Edit: I forgot to escape d
. Now all that's left to do is to make everything that follows the host optional, including the leading /
.
CodePudding user response:
Your regular expression works fine if you just escape the slashes and preferably the colon as well. The result is (. )\:\/\/(.*@)?(. ?)(:(\d*))?\/((.*)\?)?((.*)#)?(.*)
. Here is a simple script to show how it can be used to filter out invalid URIs:
Update Following the comments I have made the following modification:
- I have added
(\:((\d*)\/))?(\/)*
. Explanation:\:((\d*)
matches a colon and then any string of digits.- the
\/
after this matches a slash which should be after this string of digits. This is because the port must not contain any other characters but digits. So they cannot be found in the port-portion of the uri. - Finally, the entire port-matching expression is optional, hence the
?
. - The last part indicates that many or no slashes can follow the existing/non-existing port
Final regExp:
(. )\:\/\/(.*\@)?(. ?)(\:((\d*)\/))?(\/)*((.*)\?)?((.*)\#)?(.*)
const myRegEx = new RegExp("(. )\:\/\/(.*\@)?(. ?)(\:((\d*)\/))?(\/)*((.*)\?)?((.*)\#)?(.*)", "g");
const allUris = [
/*Valid*/ "https://[email protected]:5050/page?query=value#element",
/*Valid*/ "foo://example.com:8042/over/there?name=ferret#nose",
/*Valid*/ "foo://example.com",
/*Not valid*/ "www.example.com"];
const allowedUris = allUris.map(uri => {
// Use the regexp to match it, then return the match
const match = uri.match(myRegEx);
return match;
});
console.log("Here are the valid URIs:");
console.log(allowedUris.join("\n\n")); // Should only print the first two URIs from the array.