I have the following code:
$content = 'whatever <iframe style="display:none;" src="https://www.example.com/hello/id"></iframe>';
$dom = new DOMDocument();
$dom->loadHTML($content);
$xp = new DOMXpath($dom);
$nodes = $xp->query("iframe[src*='.example.com/hello/']");
foreach($nodes as $node){
echo $node->nodeName ." : ". $node->nodeValue, PHP_EOL;
}
Could anyone tell me why Xpath query couldn't match the iframe? What am I doing wrong?
CodePudding user response:
Your code as it is, is raising some warning:
Warning: DOMXPath::query(): Invalid expression in ... on line ...
A good idea would be to display those warning in your server, for this, see https://stackoverflow.com/a/21429652/2123530.
So, your XPath query is invalid, and this comes from the way you are trying to search for the attribute src
to contain a string.
The construct your are using there is a CSS construct, not an XPath one.
The equivalent in XPath would be
iframe[contains(@src, '.example.com/hello/')]
But with that, you're not done yet, because, when you are feeding a random part of an HTML node like that to DOMDocument
, it will try to make it a valid HTML document, and so doing something like:
<?php
$content = 'whatever <iframe style="display:none;" src="https://www.example.com/hello/id"></iframe>';
$dom = new DOMDocument();
$dom->loadHTML($content);
$dom->formatOutput = true;
echo $dom->saveXML();
Will make you realise that your HTML code — the one from $content
— became
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<p>whatever <iframe style="display:none;" src="https://www.example.com/hello/id"/></p>
</body>
</html>
From there on, you have three solution:
- Either you look for any matching
iframe
in the whole HTML document//iframe[contains(@src,'.example.com/hello/')]
- Either you point on it in its specific level
html > body > p > iframe
/html/body/p/iframe[contains(@src,'.example.com/hello/')]
- Either you point on it in its specific level with wildcard for the parent nodes
/*/*/*/iframe[contains(@src,'.example.com/hello/')]
All together
<?php
$content = 'whatever <iframe style="display:none;" src="https://www.example.com/hello/id"></iframe>';
$dom = new DOMDocument();
$dom->loadHTML($content);
$xp = new DOMXpath($dom);
echo $xp->query("//iframe[contains(@src,'.example.com/hello/')]")
->item(0)
->nodeName,
PHP_EOL,
$xp->query("/html/body/p/iframe[contains(@src,'.example.com/hello/')]")
->item(0)
->nodeName,
PHP_EOL,
$xp->query("/*/*/*/iframe[contains(@src,'.example.com/hello/')]")
->item(0)
->nodeName;
Gives:
iframe
iframe
iframe