Home > OS >  in DOMDocument why does an "en dash" in a title tag break unicode strings?
in DOMDocument why does an "en dash" in a title tag break unicode strings?

Time:04-30

why does an "en dash" in a title tag break unicode strings in DOMDocument? this code

<?php
$html = <<<'HTML'
<!DOCTYPE html>
<html><head>
    <title>example.org – example.org - example.org</title>
    <meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>
HTML;
$domd = new DOMDocument("1.0", "UTF-8");
@$domd->loadHTML($html);
$xp = new DOMXPath($domd);
$interesting = $domd->getElementsByTagName("body")->item(0)->textContent;
var_dump($interesting, bin2hex($interesting));

prints the nonsense

string(14) "Trädgård"
string(28) "5472c383c2a46467c383c2a57264"

however if we just remove the en-dash from line 5, change it to

    <title>example.org example.org - example.org</title>

it prints

string(10) "Trädgård"
string(20) "5472c3a46467c3a57264"

so why does en-dash break unicode strings in DOMDocument?

(took me a long time to track down that the en-dash is the cause x.x )

CodePudding user response:

don't know why, exactly, but the key here seems to be that any unicode characters occurring before the utf-8 declaration will confuse it, meaning:

<!DOCTYPE html>
<html><head>
    <title>æøå</title>
    <meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>

will confuse it, while

<!DOCTYPE html>
<html><head>
    <meta charset="utf-8" />
    <title>æøå</title>
</head>
<body>Trädgård</body>
</html>

works fine.. and @Tino Didriksen found this quote from https://www.w3.org/International/questions/qa-html-encoding-declarations

so it's best to put it immediately after the opening head tag.

and.. as the top rated comment in the loadHTML documentation mentions, a quick'n dirty workaround is

$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

CodePudding user response:

Replace any existing charset meta, then prepend a proper XML header:

$html = preg_replace('~<meta charset=[^>]*>~is', '', $html);
$html = '<?xml version="1.0" encoding="utf-8"?>'."\n".$html;

Can also inject the HTML4-style header, but XML works in more contexts. Eg:

$orig = <<<'HTML'
<!DOCTYPE html>
<html><head>
    <title>example.org – example.org - example.org</title>
    <meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>
HTML;

$html = $orig;
$html = preg_replace('~<meta charset=[^>]*>~is', '', $html);
$html = str_ireplace('<head>', '<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">', $html);

$domd = new DOMDocument("1.0", "UTF-8");
$domd->loadHTML($html);
$xp = new DOMXPath($domd);
echo "HTML4: ".$domd->getElementsByTagName("body")->item(0)->textContent."\n";

$html = $orig;
$html = preg_replace('~<meta charset=[^>]*>~is', '', $html);
$html = '<?xml version="1.0" encoding="utf-8"?>'."\n".$html;

$domd = new DOMDocument("1.0", "UTF-8");
$domd->loadHTML($html);
$xp = new DOMXPath($domd);
echo "XML: ".$domd->getElementsByTagName("body")->item(0)->textContent."\n";

Output:

HTML4: Trädgård
XML: Trädgård
  • Related