Home > OS >  Xpath query couldn't match
Xpath query couldn't match

Time:10-02

I have the following code:

$content = 'whatever <iframe style="display:none;" src="https://www.example.com/hello/id"></iframe>';

$dom = new DOMDocument();
$dom->loadHTML($content);
$xp = new DOMXpath($dom);

$nodes = $xp->query("iframe[src*='.example.com/hello/']");

foreach($nodes as $node){
    echo $node->nodeName ." :  ". $node->nodeValue, PHP_EOL;
}

Could anyone tell me why Xpath query couldn't match the iframe? What am I doing wrong?

CodePudding user response:

Your code as it is, is raising some warning:

Warning: DOMXPath::query(): Invalid expression in ... on line ...

A good idea would be to display those warning in your server, for this, see https://stackoverflow.com/a/21429652/2123530.


So, your XPath query is invalid, and this comes from the way you are trying to search for the attribute src to contain a string.

The construct your are using there is a CSS construct, not an XPath one.
The equivalent in XPath would be

iframe[contains(@src, '.example.com/hello/')]

But with that, you're not done yet, because, when you are feeding a random part of an HTML node like that to DOMDocument, it will try to make it a valid HTML document, and so doing something like:

<?php
$content = 'whatever <iframe style="display:none;" src="https://www.example.com/hello/id"></iframe>';

$dom = new DOMDocument();
$dom->loadHTML($content);
$dom->formatOutput = true;
echo $dom->saveXML();

Will make you realise that your HTML code — the one from $content — became

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <p>whatever <iframe style="display:none;" src="https://www.example.com/hello/id"/></p>
  </body>
</html>

From there on, you have three solution:

  • Either you look for any matching iframe in the whole HTML document
    //iframe[contains(@src,'.example.com/hello/')]
    
  • Either you point on it in its specific level html > body > p > iframe
    /html/body/p/iframe[contains(@src,'.example.com/hello/')]
    
  • Either you point on it in its specific level with wildcard for the parent nodes
    /*/*/*/iframe[contains(@src,'.example.com/hello/')]
    

All together

<?php
$content = 'whatever <iframe style="display:none;" src="https://www.example.com/hello/id"></iframe>';

$dom = new DOMDocument();
$dom->loadHTML($content);

$xp = new DOMXpath($dom);

echo $xp->query("//iframe[contains(@src,'.example.com/hello/')]")
        ->item(0)
        ->nodeName,
     PHP_EOL,
     $xp->query("/html/body/p/iframe[contains(@src,'.example.com/hello/')]")
        ->item(0)
        ->nodeName,
     PHP_EOL,
     $xp->query("/*/*/*/iframe[contains(@src,'.example.com/hello/')]")
        ->item(0)
        ->nodeName;

Gives:

iframe
iframe
iframe
  • Related