why does an "en dash" in a title tag break unicode strings in DOMDocument? this code
<?php
$html = <<<'HTML'
<!DOCTYPE html>
<html><head>
<title>example.org – example.org - example.org</title>
<meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>
HTML;
$domd = new DOMDocument("1.0", "UTF-8");
@$domd->loadHTML($html);
$xp = new DOMXPath($domd);
$interesting = $domd->getElementsByTagName("body")->item(0)->textContent;
var_dump($interesting, bin2hex($interesting));
prints the nonsense
string(14) "Trädgård"
string(28) "5472c383c2a46467c383c2a57264"
however if we just remove the en-dash from line 5, change it to
<title>example.org example.org - example.org</title>
it prints
string(10) "Trädgård"
string(20) "5472c3a46467c3a57264"
so why does en-dash break unicode strings in DOMDocument?
(took me a long time to track down that the en-dash is the cause x.x )
CodePudding user response:
don't know why, exactly, but the key here seems to be that any unicode characters occurring before the utf-8 declaration will confuse it, meaning:
<!DOCTYPE html>
<html><head>
<title>æøå</title>
<meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>
will confuse it, while
<!DOCTYPE html>
<html><head>
<meta charset="utf-8" />
<title>æøå</title>
</head>
<body>Trädgård</body>
</html>
works fine.. and @Tino Didriksen found this quote from https://www.w3.org/International/questions/qa-html-encoding-declarations
so it's best to put it immediately after the opening head tag.
and.. as the top rated comment in the loadHTML documentation mentions, a quick'n dirty workaround is
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
CodePudding user response:
Replace any existing charset meta, then prepend a proper XML header:
$html = preg_replace('~<meta charset=[^>]*>~is', '', $html);
$html = '<?xml version="1.0" encoding="utf-8"?>'."\n".$html;
Can also inject the HTML4-style header, but XML works in more contexts. Eg:
$orig = <<<'HTML'
<!DOCTYPE html>
<html><head>
<title>example.org – example.org - example.org</title>
<meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>
HTML;
$html = $orig;
$html = preg_replace('~<meta charset=[^>]*>~is', '', $html);
$html = str_ireplace('<head>', '<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">', $html);
$domd = new DOMDocument("1.0", "UTF-8");
$domd->loadHTML($html);
$xp = new DOMXPath($domd);
echo "HTML4: ".$domd->getElementsByTagName("body")->item(0)->textContent."\n";
$html = $orig;
$html = preg_replace('~<meta charset=[^>]*>~is', '', $html);
$html = '<?xml version="1.0" encoding="utf-8"?>'."\n".$html;
$domd = new DOMDocument("1.0", "UTF-8");
$domd->loadHTML($html);
$xp = new DOMXPath($domd);
echo "XML: ".$domd->getElementsByTagName("body")->item(0)->textContent."\n";
Output:
HTML4: Trädgård
XML: Trädgård