I am trying to scrap wikipedea page with plain php and have been using xpath->query
to search the dom. I am trying to select the node which has text Known for
on this Wikipedia page https://en.wikipedia.org/wiki/Ajmal_Kasab The text is in the right hand side table before the text 2008 Mumbai attacks
. I loaded the page with DOMDocument::loadHtml
, and did the following:
var_dump( $value->saveHTML($xpath->query( "//table[@class[contains(.,'infobox')]]//tr[th='Known for']/th/text()" )[0]) );
I tried Known\x20for
, Known for
and Known for
etc. But they didn't work. Fortunately I stumbled upon this Using XPATH to search text containing post and tried manually pressing Alt 0160
on my windows 10 pc in sublime 3 editor. The expression looks like this Known<0xa0>for
-- it worked.
My question 1 is why in the world won't xpath accept a normal space
or the literal  
? The Wikipedia page source has it as Known for
. What if I had Linux or a different text editor? Currently, I am working locally, would it work on my Linux based server as well? What is the computer science behind this?
Secondly I need to convert xpath
result set, which contains spaces into a php varable which stores <0xa0>
. I have:
$tmp = $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='Known<0xa0>for']/th/text()");
$result = $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='{$tmp}']/td/text()");
Seems like the variable $tmp
doesn't hold to <0xa0>
and in turn $result
is incorrect(false).
The whole php code is more complex and the to-be-searched words are a lot. So I have boiled the code down to a simpler task. Words like Known for
are dynamic and fed into a function.
CodePudding user response:
XPath is designed to be hosted in other programming languages (PHP in your case) and rather than having an escaping convention of its own, it relies on the escaping conventions of the host language. So you enter a NBSP (xa0) character in the XPath expression the same way as you would enter it in any other PHP string literal, for example \xA0
.
 
would be appropriate when XPath is hosted in XML, or
when it is hosted in HTML, but not when it is hosted in PHP.
You ask "what is the computer science behind this?". Basically, it's to avoid the double-escaping problem. When a sublanguage such as regex has an escape convention (e.g. \\
to represent \
) and is then hosted in another language with a similar escape convention, you end up having to write \
as \\\\
(or &
as &amp;
). Since XPath was designed explicitly for hosting within other languages, they decided to use the host-language escaping capabilities rather than superimpose their own.
CodePudding user response:
You claim "The Wikipedia page source has it as Known for
" which is not true at all, it has Known for
. Secondly you call  
a literal, even if you meant  
, that is not a literal, it is a HTML numeric character reference, i.e. an escaping mechanism HTML has to not use a literal character. Of course your XPath doesn't work on the HTML source code, you have feed your string to the loadHtml
method which uses an HTML parser to parse the HTML source string, so the resulting DOM tree certainly doesn't have any representation of the form  
or &nbnsp;
, it just has a text node with Unicode characters, one of them being the character with decimal Unicode 160 or the hexadecimal U00A0.
Neither XPath nor PHP require you to escape that character in a PHP string literal (https://www.php.net/manual/en/language.types.string.php) as <0xa0>
, it should be \xA0
.
For the second part of the question, what kind of value do you expect to get from $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='Known<0xa0>for']/th/text()")
? A DOM node list? What do you expect to achieve by putting that variable into another PHP string literal in the $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='{$tmp}']/td/text()")
?
If you want a PHP string from an XPath evaluation use an expression which doesn't return nodes but a string (string(//th)
would return a string with the string value of the first th
element) and use the evaluate
method, not the query
method.