How to make a regex case-insensitive for a middle part of the expression?-CodePudding

So I am trying to learn regular expressions. I am using a website that provided me with the code for a URL checker, looking like this:

/^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9] ([\-\.]{1}[a-z0-9] )*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/gm

The Site also provided some test URLs; the Regex marks the following URLs:

https://www.google.com

http://www.google.com

www.google.com

but not:

htt://www.google.com

://www.google.com

which is correct.

However, it does not mark www.Google.com because of the capital G.

I am aware I can just use [A-Za-z0-9] for the 2nd square bracket and it works fine, but I am wondering if there is a way to use the i- and/or ?- operator to do this, meaning only having the middle part of the URL (Google) case insensitive, while everything else remains case sensitive. Thanks!

CodePudding user response：

Yes, you can use the i flag to make the regular expression case-insensitive. The i flag can be placed at the end of the regular expression, like so:

/^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9] ([\-\.]{1}[a-z0-9] )*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/gmi

This makes the entire regular expression case-insensitive, which means that it will match both "Google" and "google" in the middle of the URL. If you want to make only the middle part of the URL (the domain name) case-insensitive, you can use the (?i) inline flag:

This way, only the domain name will be case-insensitive while the rest of the regular expression will remain case-sensitive.

CodePudding user response：

In many regex implementations (python, c#, java, perl, ruby, r, swift, go) you can use the (?i:foo) pattern to ignore case just for foo. This is commented out regex0 below.

Unfortunately JavaScript does not supported the turn case sensitivity on/off within a regex. Therefore you need to use pattern [A-Za-z0-9] instead of [a-z0-9] if you want to ignore case selectively in a regex.

//const regex0 = /^(?:https?:\/\/(?:www\.)?)?(?i:[a-z0-9] (?:[\-\.][a-z0-9] )*)\.[a-z]{2,5}(?::[0-9]{1,63})?(?:\/.*)?$/;
const regex1 = /^(?:https?:\/\/(?:www\.)?)?[a-z0-9] (?:[\-\.][a-z0-9] )*\.[a-z]{2,63}(?::[0-9]{1,5})?(?:\/.*)?$/;
const regex2 = /^(?:https?:\/\/(?:www\.)?)?[A-Za-z0-9] (?:[\-\.][A-Za-z0-9] )*\.[a-z]{2,63}(?::[0-9]{1,5})?(?:\/.*)?$/;
[
  'https://www.google.com',
  'https://www.Google.com',
  'http://www.google.com',
  'www.google.com',
  'htt://www.google.com',
  '://www.google.com'
].forEach(str => console.log(str, '=> regex1:', regex1.test(str), ', regex2:', regex2.test(str)));

Output:

https://www.google.com => regex1: true , regex2: true
https://www.Google.com => regex1: false , regex2: true
http://www.google.com => regex1: true , regex2: true
www.google.com => regex1: true , regex2: true
htt://www.google.com => regex1: false , regex2: false
://www.google.com => regex1: false , regex2: false

Notes on regex

simplified the optional http://, https://, http://www., https://www. with nested non-capture groups
[\-\.]{1} and [\-\.] are identical, so use the simple version
TLD max length is 3hars, not 5

Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex