Mojo::DOM breaking UTF8 in Perl-CodePudding

I'm trying to find out how to use Mojo::DOM with UTF8 (and other formats... not just UTF8). It seems to mess up the encoding:

    my $dom = Mojo::DOM->new($html);

    $dom->find('script')->reverse->each(sub {
        #print "$_->{id}\n";
        $_->remove;
    });

    $dom->find('style')->reverse->each(sub {
        #print "$_->{id}\n";
        $_->remove;
    });

    $dom->find('script')->reverse->each(sub {
        #print "$_->{id}\n";
        $_->remove;
    });

    my $html = "$dom"; # pass back to $html, now we have cleaned it up...

This is what I get when saving the file without running it through Mojo:

...and then once through Mojo:

FWIW, I'm grabbing the HTML file using Path::Tiny, with:

my $utf8 = path($_[0])->slurp_raw;

Which to my understanding, should already have the string decoded into bytes ready for Mojo?

UPDATE: After Brians suggestion, I looked into how I could figure out the encoding type to decode it correctly. I tried Encode::Guess and a few others, but they seemed to get it wrong on quite a few. This one seems to do the trick:

my $enc_tmp = `encguess $_[0]`;
my ($fname,$type) = split /\s /, $enc_tmp;
my $decoded = decode( $type||"UTF-8", path($_[0])->slurp_raw );

CodePudding user response：

You are slurping raw octets but not decoding them (storing the raw in $utf8). Then you treat it as if you had decoded it, so the result is mojibake.

If you read raw octets, decode it before you use it. You'll end up with the right Perl internal string.
slurp_utf8 will decode for you.
Likewise, you have to encode when you output again. The open pragma does that in this example.
Mojolicious already has Mojo::File->slurp to get raw octets, so you can reduce your dependency list.

use v5.10;
use utf8;

use open qw(:std :utf8);
use Path::Tiny;
use Mojo::File;
use Mojo::Util qw(decode);

my $filename = 'test.txt';
open my $fh, '>:encoding(UTF-8)', $filename;
say { $fh } "Copyright © 2022";
close $fh;

my $octets = path($filename)->slurp_utf8;

say "===== Path::Tiny::slurp_raw, no decode";
say path($filename)->slurp_raw;

say "===== Path::Tiny::slurp_raw, decode";
say decode( 'UTF-8', path($filename)->slurp_raw );

say "===== Path::Tiny::slurp_utf8";
say path($filename)->slurp_utf8;

say "===== Mojo::File::slurp, decode";
say  decode( 'UTF-8', Mojo::File->new($filename)->slurp );

The output:

===== Path::Tiny::slurp_raw, no decode
Copyright Â© 2022

===== Path::Tiny::slurp_raw, decode
Copyright © 2022

===== Path::Tiny::slurp_utf8
Copyright © 2022

===== Mojo::File::slurp, decode
Copyright © 2022