I have this:
[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\ ~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\ .~#?&\/\/=]*)
It matches for:
www.example.com
http://example.com.nz
example.com
http://www.example.com?2rjl6
example.com/first/second
https://example.us.edi?34535/534534?dfg=g&fg
etc...
I want no match if any of the above URLs are enclosed in square brackets [ ] like this:
[www.example.com]
[http://example.com.nz]
etc...
The text is long and may or may not contain more than one URL, spaces, line breaks, and so on.
e.g.
Lorem ipsum dolor sit amet, consectetur [http://example.com.nz] llamcorper et lacus. Morbi sodales convallis lectus a efficitur: example.com/first/second vitae nisl placerat.
Fusce non ipsum a augue http://example.com.nz http://www.example.com?2rjl6 aculis augue. Nullam eu nulla lectus.
In this case there should be only 3 matches.
I tried adding:
(?![^\[]*\])
But it doesn't work as expected.
Can you help me with this or recommend another approach? Thanks.
CodePudding user response:
You can match from an opening till closing square bracket, and then make use of SKIP FAIL using php.
You might also shorten the pattern a bit. You have the whole first part in a character class [(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\ ~#=]
but you can move the square bracket to before the [a-z
,
And as you can write a-zA-Z0-9_ as \w
, you can shorten the character class a bit starting with [\w
If you choose a different delimiter than /
like ~
you don't have to escape the backslash.
\[[^][]*](*SKIP)(*F)|(?:https?://)?(?:www\.)?[\w@:%. ~#=]{2,256}\.[a-z]{2,6}\b[\w-@:% .~#?&/=]*
Explanation
\[[^][]*]
Match from[...]
(*SKIP)(*F)
Skip the match|
Or(?:https?://)?
Optionally match the protocol(?:www\.)?
Optionally matchwww.
[\w@:%. ~#=]{2,256}
Repeat 2-256 times any of the listed in the character class\.[a-z]{2,6}\b
match a dot and 2-6 chars a-z followed by a word boundary[\w-@:% .~#?&/=]*
Optionally match what is listed in the character class