Home > database >  Mojo::DOM breaking UTF8 in Perl
Mojo::DOM breaking UTF8 in Perl

Time:06-30

I'm trying to find out how to use Mojo::DOM with UTF8 (and other formats... not just UTF8). It seems to mess up the encoding:

    my $dom = Mojo::DOM->new($html);

    $dom->find('script')->reverse->each(sub {
        #print "$_->{id}\n";
        $_->remove;
    });

    $dom->find('style')->reverse->each(sub {
        #print "$_->{id}\n";
        $_->remove;
    });

    $dom->find('script')->reverse->each(sub {
        #print "$_->{id}\n";
        $_->remove;
    });

    my $html = "$dom"; # pass back to $html, now we have cleaned it up...

This is what I get when saving the file without running it through Mojo:

enter image description here

...and then once through Mojo:

enter image description here

FWIW, I'm grabbing the HTML file using Path::Tiny, with:

my $utf8 = path($_[0])->slurp_raw;

Which to my understanding, should already have the string decoded into bytes ready for Mojo?

UPDATE: After Brians suggestion, I looked into how I could figure out the encoding type to decode it correctly. I tried Encode::Guess and a few others, but they seemed to get it wrong on quite a few. This one seems to do the trick:

my $enc_tmp = `encguess $_[0]`;
my ($fname,$type) = split /\s /, $enc_tmp;
my $decoded = decode( $type||"UTF-8", path($_[0])->slurp_raw );

CodePudding user response:

You are slurping raw octets but not decoding them (storing the raw in $utf8). Then you treat it as if you had decoded it, so the result is mojibake.

  • If you read raw octets, decode it before you use it. You'll end up with the right Perl internal string.
  • slurp_utf8 will decode for you.
  • Likewise, you have to encode when you output again. The open pragma does that in this example.
  • Mojolicious already has Mojo::File->slurp to get raw octets, so you can reduce your dependency list.
use v5.10;
use utf8;

use open qw(:std :utf8);
use Path::Tiny;
use Mojo::File;
use Mojo::Util qw(decode);

my $filename = 'test.txt';
open my $fh, '>:encoding(UTF-8)', $filename;
say { $fh } "Copyright © 2022";
close $fh;

my $octets = path($filename)->slurp_utf8;

say "===== Path::Tiny::slurp_raw, no decode";
say path($filename)->slurp_raw;

say "===== Path::Tiny::slurp_raw, decode";
say decode( 'UTF-8', path($filename)->slurp_raw );

say "===== Path::Tiny::slurp_utf8";
say path($filename)->slurp_utf8;

say "===== Mojo::File::slurp, decode";
say  decode( 'UTF-8', Mojo::File->new($filename)->slurp );

The output:

===== Path::Tiny::slurp_raw, no decode
Copyright © 2022

===== Path::Tiny::slurp_raw, decode
Copyright © 2022

===== Path::Tiny::slurp_utf8
Copyright © 2022

===== Mojo::File::slurp, decode
Copyright © 2022
  • Related