How does xpath in php 8 deal with  ?-CodePudding

I am trying to scrap wikipedea page with plain php and have been using xpath->query to search the dom. I am trying to select the node which has text Known for on this Wikipedia page https://en.wikipedia.org/wiki/Ajmal_Kasab The text is in the right hand side table before the text 2008 Mumbai attacks. I loaded the page with DOMDocument::loadHtml, and did the following:

var_dump( $value->saveHTML($xpath->query( "//table[@class[contains(.,'infobox')]]//tr[th='Known for']/th/text()" )[0])  );

I tried Known\x20for, Known for and Known&#160for etc. But they didn't work. Fortunately I stumbled upon this Using XPATH to search text containing   post and tried manually pressing Alt 0160 on my windows 10 pc in sublime 3 editor. The expression looks like this Known<0xa0>for -- it worked.

My question 1 is why in the world won't xpath accept a normal space or the literal &#160? The Wikipedia page source has it as Known&#160for. What if I had Linux or a different text editor? Currently, I am working locally, would it work on my Linux based server as well? What is the computer science behind this?

Secondly I need to convert xpath result set, which contains spaces into a php varable which stores <0xa0>. I have:

$tmp = $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='Known<0xa0>for']/th/text()");
$result = $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='{$tmp}']/td/text()");

Seems like the variable $tmp doesn't hold to <0xa0> and in turn $result is incorrect(false).

The whole php code is more complex and the to-be-searched words are a lot. So I have boiled the code down to a simpler task. Words like Known for are dynamic and fed into a function.

CodePudding user response：

XPath is designed to be hosted in other programming languages (PHP in your case) and rather than having an escaping convention of its own, it relies on the escaping conventions of the host language. So you enter a NBSP (xa0) character in the XPath expression the same way as you would enter it in any other PHP string literal, for example \xA0.

  would be appropriate when XPath is hosted in XML, or   when it is hosted in HTML, but not when it is hosted in PHP.

You ask "what is the computer science behind this?". Basically, it's to avoid the double-escaping problem. When a sublanguage such as regex has an escape convention (e.g. \\ to represent \) and is then hosted in another language with a similar escape convention, you end up having to write \ as \\\\ (or & as &amp;). Since XPath was designed explicitly for hosting within other languages, they decided to use the host-language escaping capabilities rather than superimpose their own.

CodePudding user response：

You claim "The Wikipedia page source has it as Known&#160for" which is not true at all, it has Known for. Secondly you call &#160 a literal, even if you meant  , that is not a literal, it is a HTML numeric character reference, i.e. an escaping mechanism HTML has to not use a literal character. Of course your XPath doesn't work on the HTML source code, you have feed your string to the loadHtml method which uses an HTML parser to parse the HTML source string, so the resulting DOM tree certainly doesn't have any representation of the form   or &nbnsp;, it just has a text node with Unicode characters, one of them being the character with decimal Unicode 160 or the hexadecimal U00A0.

Neither XPath nor PHP require you to escape that character in a PHP string literal (https://www.php.net/manual/en/language.types.string.php) as <0xa0>, it should be \xA0.

For the second part of the question, what kind of value do you expect to get from $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='Known<0xa0>for']/th/text()")? A DOM node list? What do you expect to achieve by putting that variable into another PHP string literal in the $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='{$tmp}']/td/text()")?

If you want a PHP string from an XPath evaluation use an expression which doesn't return nodes but a string (string(//th) would return a string with the string value of the first th element) and use the evaluate method, not the query method.