Home > Net >  Regex to find a specific anchor tag that have href with a specific domain and nofollow
Regex to find a specific anchor tag that have href with a specific domain and nofollow

Time:01-11

I have a string that contains html I want a regex that get me the string that has with a specific domain name and has noFollow

I have found this would will do work on the domain name but does not include nofollow condition (<a\s*(?!.\brel=)[^>])(href="https?://)((?stackoverflow)[^"] )"([^>]*)>

let's say the domain name I want is stackoverflow Example:

- "<a href="stackoverflow.com" rel = "nofollow">click here </a>" this would match
- "<a href="stackoverflow.com"> would not match since it has no follow
- "<a href="google.com" rel = "nofollow"> would not match 

CodePudding user response:

It's bit hard to match a HTML tag with specific condition, but the following regex should do it:

select regexp_match(str, '<a((?:\s (([^\/=''"<>\s] )(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`] )))?)))* href=((''(https?:\/\/)?stackoverflow\.com[^'']*'')|("(https?:\/\/)?stackoverflow\.com[^"]*"))((?: (([^\/=''"<>\s] )(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`] )))?)))*\s rel=("nofollow"|''nofollow'')((?: (([^\/=''"<>\s] )(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`] )))?)))*\/?>') from tes;

It's really hard to read, but basically most of the regex is there for matching attributes. The important thing for you is to find stackoverflow\.com (which can be found 2 times; one for href with single quote and second for double quote) and replace it with whatever domain you need (and don't forget to escape it properly).

Some notes

I don't know which regexp function you want to use, but you should be able to use it with whatever regexp function you need. Another thing is that your example <a href="stackoverflow.com" rel = "nofollow">click here </a> won't be matched, because you have spaces between attribute name and = sign (i don't know if this is valid HTML or not). It will work with this <a href="stackoverflow.com" rel="nofollow">click here </a>. If you need to match addresses which might include spaces between = signs just comment me and I'll try to edit the regex.

  • Related