Regex to find anchor tag not working accurately-CodePudding

I have the following regex to find anchor tag that has 'Kontakt' as the anchor text:

#<a.*href="[^"]*".*>Kontakt<\/a>#

Here is the string to find from:

<li ><a href="/webdesign-tipps" title="Wissenswertes zu Webdesign, Grafikdesign oder Onlinemarketing">Wissenswertes</a></li><li ><a href="/webagentur" >Webagentur</a></li><li ><a href="/team" >Team</a></li><li ><a href="/support" >Support<span ></span></a></li><li ><a href="/jobs" >Jobs</a></li><li ><a href="/kontakt" >Kontakt</a></li></ul>

So the result should be:

<a href="/kontakt" >Kontakt</a>

But the result I get is:

<a href="/webdesign-tipps" title="Wissenswertes zu Webdesign, Grafikdesign oder Onlinemarketing">Wissenswertes</a></li><li ><a href="/webagentur" >Webagentur</a></li><li ><a href="/team" >Team</a></li><li ><a href="/support" >Support<span ></span></a></li><li ><a href="/jobs" >Jobs</a></li><li ><a href="/kontakt" >Kontakt</a>

And here is my PHP code:

$pattern = '#<a.*href="[^"]*".*>Kontakt<\/a>#';
preg_match_all($pattern, $string, $matches);

CodePudding user response：

If you can trust your input will always have <a href in every anchor tag then try:


'#<a href="[^"]*"[^>]*>Kontakt<\/a>#';

// Instead of what you have:

'#<a.*href="[^"]*".*>Kontakt<\/a>/#';

.* is the "wildcard" meta-character . and the "zero or more times" quantifier * together.

.* matches anything any number of times.

Try it https://regex101.com/r/qxnRZv/1

CodePudding user response：

You are using preg_match_all() so I assume you are willing to receive multiple qualifying anchor tags. Parsing valid HTML with a legitimate DOM parser will always be more stable and easier to read than the equivalent regex technique. It's just not a good idea to rely on regex for DOM parsing because regex is "DOM-unaware" -- it just matches things that look like HTML entities.

In the XPath query, search for <a> tags (existing at any depth in the document) which have the qualifying string as the whole text.

Code: (Demo)

$html = <<<HTML
<li ><a href="/webdesign-tipps" title="Wissenswertes zu Webdesign, Grafikdesign oder Onlinemarketing">Wissenswertes</a></li><li ><a href="/webagentur" >Webagentur</a></li><li ><a href="/team" >Team</a></li><li ><a href="/support" >Support<span ></span></a></li><li ><a href="/jobs" >Jobs</a></li><li ><a href="/kontakt" >Kontakt</a></li></ul>
HTML;

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//a[text() = "Kontakt"]') as $a) {
    $result[] = $dom->saveHtml($a);
}
var_export($result);

Output:

array (
  0 => '<a href="/kontakt">Kontakt</a>',
)

Is it more concise to use regex? Yes, but it is also less reliable for general use.

You will notice that the DOMDocument also automatically cleans up the unnecessary spacing in your markup.

CodePudding user response：

Your regex:

...a.*href...

is greedy, which means: "after a, match as many characters as possible before a href". That causes your regex to return multiple hrefs.

You can use the lazy-mode operator ? :

...a.*?href....

which means "after a, match as few characters as possible before a href". It should work.