I am trying to print all the <p>
elements of a particular HTML document fetched from a URL
. The HTML document is using UTF-8 encoding.
This is my code:
<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);
header('Content-Type: text/plain; charset=utf-8');
header('Access-Control-Allow-Origin: *');
header('Access-Control-Allow-Methods: POST, GET, OPTIONS');
$url = "https://www.sangbadpratidin.in/kolkata/ispat-express-met-an-accident-near-howrah-junction/#.Y7qC6YFeT80.whatsapp";
$user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$html=curl_exec($ch);
if (!curl_errno($ch)) {
$resultStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($resultStatus == 200) {
@$DOM = new DOMDocument;
@$DOM->loadHTML($html);
$bodies = $DOM->getElementsByTagName('p');
foreach($bodies as $body){
$para = $body->nodeValue;
echo $para;
}
}
}
?>
The HTML document is filled with Bengali characters, when I try to print the values, this is what gets printed:
সà§à¦¬à§à¦°à¦¤ বিশà§à¦¬à¦¾à¦¸: ফà§à¦° দà§à¦°à§à¦à¦à¦¨à¦¾à¦° à¦à¦¬à¦²à§ দà§à...
Why am I not getting the original text? Please help me
CodePudding user response:
edit: i just TESTED it, yeah this fixed it :) see it live at https://dh.ratma.net/test/test2.php
known issue with DOMDocument not realizing its UTF-8, and defaulting to some horrible windows-1252 encoding, and proceeds to corrupt actual UTF-8 multibyte characters. with a bit of luck, replacing
@$DOM->loadHTML($html);
with
@$DOM->loadHTML('<?xml encoding="UTF-8">' . $html);
should fix it.
CodePudding user response:
Changing $DOM->loadHTML($html)
to $DOM->loadHTML(mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8"))
seems to resolve the issue.
Source: PHP DOMDocument loadHTML not encoding UTF-8 correctly