Home > Enterprise >  Perl - Encoding error when working with .html file
Perl - Encoding error when working with .html file

Time:11-03

I have some .html files in a directory to which I want to add one line of css code. Using perl, I can locate the position with a regex and add the css code, this works very well.

However, my first .html file contain an accented letter: é but the resulting .html file has an encoding problem and prints: \xE9

In the perl file, I have been careful to specify UTF-8 encoding when opening and closing the files, has shown in the MWE below, but that does not solve the problem. How can I solve this encoding error?

MWE

use strict;
use warnings;
use File::Spec::Functions qw/ splitdir rel2abs /; # To get the current directory name

# Define variables
my ($inputfile, $outputfile, $dir);

# Initialize variables
$dir = '.';

# Open current directory
opendir(DIR, $dir);

# Scan all files in directory
while (my $inputfile = readdir(DIR)) {
    
    #Name output file based on input file
    $outputfile = $inputfile;
    $outputfile =~ s/_not_centered//;
    
    # Open output file
    open(my $ofh, '>:encoding(UTF-8)', $outputfile);

    # Open only files containning ending in _not_centered.html
    next unless (-f "$dir/$inputfile");
    next unless ($inputfile =~ m/\_not_centered.html$/);
    
    # Open input file
    open(my $ifh, '<:encoding(UTF-8)', $inputfile);
    
    # Read input file
    while(<$ifh>) {
        # Catch and store the number of the chapter
        if(/(<h2)(.*?)/) {
            # $_ =~ s/<h2/<h2 style="text-align: center;"/;
            print $ofh "$1 style=\"text-align: center;\"$2";
        }else{
            print $ofh "$_";
        }
    }
    
    # Close input and output files
    close $ifh;
    close $ofh;
}

# Close output file and directory
closedir(DIR);

Problematic file named "Chapter_001_not_centered.html"

<html > 
<head></head>
<body>
                                                           
<h2 ><span >Chapter&#x00A0;1</span><br /><a id="x1-10001"></a>Brocéliande</h2>
Brocéliande

</body></html>

CodePudding user response:

Following demo script does required inject with utilization of glob function.

Note: the script creates a new file, uncomment rename to make replacement original file with a new one

use strict;
use warnings;

use open ":encoding(Latin1)";

my $dir = '.';

process($_) for glob("$dir/*_not_centered.html");

sub process {
    my $fname_in  = shift;
    my $fname_new = $fname_in . '.new';
    
    open my $in, '<', $fname_in
        or die "Couldn't open $fname_in";
        
    open my $out, '>', $fname_new
        or die "Couldn't open $fname_new";
        
    while( <$in> ) {
        s/<h2/<h2 style="text-align: center;"/;
        print $out $_;
    }
    
    close $in;
    close $out;

    # rename $fname_new, $fname_in
    #    or die "Couldn't rename $fname_new to $fname_in";

}

If you do not mind to run following script per individual file basis script.pl in_file > out_file

use strict;
use warnings;

print s/<h2/<h2 style="text-align: center;"/ ? $_ : $_ for <>;

In case if such task arises only occasionally then it can be solved with one liner

perl -pe "s/<h2/<h2 style='text-align: center;'/" in_file

CodePudding user response:

This question found an answer in the commments of @Shawn and @ sticky bit:

By changing the encoding to open and close the files to ISO 8859-1, it solves the problem. If one of you wants to post the answer, I will validate it.

  • Related