Home > Mobile >  Regex for URLs outside square bracket in text PHP
Regex for URLs outside square bracket in text PHP

Time:08-09

I have this:

[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\ ~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\ .~#?&\/\/=]*)

It matches for:

www.example.com 
http://example.com.nz 
example.com
http://www.example.com?2rjl6
example.com/first/second
https://example.us.edi?34535/534534?dfg=g&fg

etc...

I want no match if any of the above URLs are enclosed in square brackets [ ] like this:

[www.example.com]
[http://example.com.nz]
etc...

The text is long and may or may not contain more than one URL, spaces, line breaks, and so on.

e.g.

Lorem ipsum dolor sit amet, consectetur [http://example.com.nz] llamcorper et lacus. Morbi sodales convallis lectus a efficitur: example.com/first/second vitae nisl placerat.

Fusce non ipsum a augue http://example.com.nz http://www.example.com?2rjl6 aculis augue. Nullam eu nulla lectus.

In this case there should be only 3 matches.

I tried adding:

(?![^\[]*\])

But it doesn't work as expected.

Can you help me with this or recommend another approach? Thanks.

CodePudding user response:

You can match from an opening till closing square bracket, and then make use of SKIP FAIL using php.

You might also shorten the pattern a bit. You have the whole first part in a character class [(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\ ~#=] but you can move the square bracket to before the [a-z,

And as you can write a-zA-Z0-9_ as \w, you can shorten the character class a bit starting with [\w

If you choose a different delimiter than / like ~ you don't have to escape the backslash.

\[[^][]*](*SKIP)(*F)|(?:https?://)?(?:www\.)?[\w@:%. ~#=]{2,256}\.[a-z]{2,6}\b[\w-@:% .~#?&/=]*

Explanation

  • \[[^][]*] Match from [...]
  • (*SKIP)(*F) Skip the match
  • | Or
  • (?:https?://)? Optionally match the protocol
  • (?:www\.)? Optionally match www.
  • [\w@:%. ~#=]{2,256} Repeat 2-256 times any of the listed in the character class
  • \.[a-z]{2,6}\b match a dot and 2-6 chars a-z followed by a word boundary
  • [\w-@:% .~#?&/=]* Optionally match what is listed in the character class

Regex demo

  • Related