Home > Net >  UTF-8 encoded characters show as gibberish in PHP
UTF-8 encoded characters show as gibberish in PHP

Time:02-01

I am trying to print all the <p> elements of a particular HTML document fetched from a URL. The HTML document is using UTF-8 encoding.

This is my code:

<?php
    error_reporting(E_ALL);
    ini_set('display_errors', 1);
    header('Content-Type: text/plain; charset=utf-8');
    header('Access-Control-Allow-Origin: *');
    header('Access-Control-Allow-Methods: POST, GET, OPTIONS');

    $url = "https://www.sangbadpratidin.in/kolkata/ispat-express-met-an-accident-near-howrah-junction/#.Y7qC6YFeT80.whatsapp";

    $user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"; 
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_VERBOSE, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_URL,$url);
    $html=curl_exec($ch);

    if (!curl_errno($ch)) {
        $resultStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        if ($resultStatus == 200) {
            @$DOM = new DOMDocument;
            @$DOM->loadHTML($html);
            
            $bodies = $DOM->getElementsByTagName('p');
            foreach($bodies as $body){
                $para = $body->nodeValue;
                echo $para;
            }
        }
    }
?>

The HTML document is filled with Bengali characters, when I try to print the values, this is what gets printed:

সà§à¦¬à§à¦°à¦¤ বিশà§à¦¬à¦¾à¦¸: ফà§à¦° দà§à¦°à§à¦à¦à¦¨à¦¾à¦° à¦à¦¬à¦²à§ দà§à...

Why am I not getting the original text? Please help me

CodePudding user response:

edit: i just TESTED it, yeah this fixed it :) see it live at https://dh.ratma.net/test/test2.php

known issue with DOMDocument not realizing its UTF-8, and defaulting to some horrible windows-1252 encoding, and proceeds to corrupt actual UTF-8 multibyte characters. with a bit of luck, replacing

@$DOM->loadHTML($html);

with

@$DOM->loadHTML('<?xml encoding="UTF-8">' . $html);

should fix it.

CodePudding user response:

Changing $DOM->loadHTML($html) to $DOM->loadHTML(mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8")) seems to resolve the issue.

Source: PHP DOMDocument loadHTML not encoding UTF-8 correctly

  • Related