Aligning strings in the UTF-8 using the `printf`, `sprintf` functions in Perl-CodePudding

I have a Perl script, the task of which is to align strings in the UTF8 encoding and write them to a file, part of the script is presented below:

#!/usr/bin/perl
use strict;
use utf8;
use locale;
use warnings;
...
my $length_sv = 9;
open my $out, '>>:encoding(UTF-8)', "filename" or warn "Could not open file - $!" and exit(1);
my ($tid, $cid, $v3, $l, $v5, $sub) = $_ =~ /^\{"id":(\d ),"customer_id":(\d )(.*?)_login":"(\w{1,10})"(.*?)"subject":"(.*?)"/;
my $subc = substr($sub, 0, $length_sv);
say $subc;
my $string = sprintf "| %-5s | %-1s | %-9s | %-${length_sv}s | %-11s | %-10s|","$time","$num","$tid","$subc","$cid","$l";
say $string;
say $out $string;
close $out;

After running the script, in STDOUT we get the following conclusion:

Тест Mark
| 11:00 | 1 | 1234567   | Тест Mark | 10101012      | login   |

But the same line is written to the file with an error:

$ cat filename
| 11:00 | 1 | 1234567   | Ð¢ÐµÑÑ Mark | 10101012      | login   |

I want the column Тест Mark to be written, but a column Ð¢ÐµÑÑ Mark is added to the file.

I tried adding a line like this to the script:

binmode($out,':utf8');

Unfortunately it didn't help. How can you fix this?

CodePudding user response：

My suggestion about your mistake:

#!/usr/bin/env perl

use strict;
use warnings;
use utf8;
use v5.10;

open my $fh0, '>:encoding(UTF-8)', './russian_text' or die $!;

print $fh0 'Тест';

close( $fh0 );

say __LINE__, ': ', `cat ./russian_text`;

foreach my $mode ( '<', '<:encoding(UTF-8)' ) {
    open my $fh1, $mode, './russian_text' or die $!;
    my $line = <$fh1>;
    chomp $line;

    open my $fh2, '>:encoding(UTF-8)', './russian_out'  or die $!;
    print $fh2 "Mode: $mode, line: ", $line;
    close( $fh2 );
    say `cat ./russian_out`;
}

From perlopentut:

But never use the bare "<" without having set up a default encoding first. Otherwise, Perl cannot know which of the many, many, many possible flavors of text file you have, and Perl will have no idea how to correctly map the data in your file into actual characters it can work with. Other common encoding formats including "ASCII", "ISO-8859-1", "ISO-8859-15", "Windows-1252", "MacRoman", and even "UTF-16LE". See perlunitut for more about encodings.

perlunitut may be also useful

CodePudding user response：

Decode your inputs and encode your outputs.

Bug #1: $sub and thus $subc contains text encoded using UTF-8, but printing to a file handle with an encoding layer expects decoded text. The consequence is that you end up with "double-encoded" text in the file. You need to decode your input.

Bug #2: Fixing the first bug will reveal another. You added an encoding layer to your file handle, but not to STDOUT. To fix this, add an encoding layer decode your STDOUT too.

Fixed version:

# Adds an encoding layer to STDIN, STDOUT and STDERR.
# Sets the default encoding layers for handles opened in scope (including via ARGV).
use open ':std', ':encoding(UTF-8)';

use JSON qw( from_json );

open(my $fh, '>:encoding(UTF-8)', $qfn)
   or die("Can't open \"$qfn\": $!\n");

while ( my $json = <> ) {
   my $data = from_json($_);

   my $tid = $data->{id};
   my $cid = $data->{customer_id};
   my ($l) = map $data->{$_}, grep /_login\z/, keys(%$data);
   my $sub = $data->{subject};

   my $subc = substr($sub, 0, $length_sv);
   say $subc;

   my $string = sprintf "| %-5s | %-1s | %-9s | %-${length_sv}s | %-11s | %-10s|",
      $time, $num, $tid, $subc, $cid, $l;
   say $string;
   say $fh $string;
}

I also replaced your hand-rolled JSON parser with a proper one.

CodePudding user response：

An example of my script:

#!/usr/bin/perl
use strict;
use utf8;
#use locale;
use warnings;

my $curl = qq( curl -s "https://example.com/all" -H "authorization: Bearer $token" \\
    --data-raw '{"scope":[{"by_subject":{"login":['"$id"']}}],"page":0,"page_size":10}' \\
    --compressed 2>/dev/null );
my $response = `$curl`;

$response =~ s/\},\{/\}###\{/g;
my @array = split(/###/, $response);

my $length_sv = 9;
#open my $out, '>>:encoding(UTF-8)', "filename" or warn "Could not open file - $!" and exit(1);
open my $out, '>>', "filename" or warn "Could not open file - $!" and exit(1);

foreach(@array) {
    my ($tid, $cid, $v3, $l, $v5, $sub) = $_ =~ /^\{"id":(\d ),"customer_id":(\d )(.*?)_login":"(\w{1,10})"(.*?)"subject":"(.*?)"/;
    my $subc = substr($sub, 0, $length_sv);
    say $subc;
    my $string = sprintf "| %-5s | %-1s | %-9s | %-${length_sv}s | %-11s | %-10s|","$time","$num","$tid","$subc","$cid","$l";
    say $string;
    say $out $string;
}

close $out;

I removed the line use locale and replaced open my $out, '>>:encoding(UTF-8)' with open my $out, '>>'. After that, the file is written with the correct encoding.

You should set encoding for input file where "Тест" come from.

Bug #1: $sub and thus $subc contains text encoded using UTF-8, but printing to a file handle with an encoding layer expects decoded text. The consequence is that you end up with "double-encoded" text in the file. You need to decode your input.

Could you tell me how to correctly specify the encoding of incoming data from an external program?

Bug #2: Fixing the first bug will reveal another. You added an encoding layer to your file handle, but not to STDOUT. To fix this, add an encoding layer decode your STDOUT too.

I believe the correct way to do this is to add:

binmode(STDOUT,':utf8');

The question remains with the indication of the encoding for the input data.

I am getting input this way:

my $curl = qq( curl -s "https://example.com/all" -H "authorization: Bearer $token" \\
    --data-raw '{"scope":[{"by_subject":{"login":['"$id"']}}],"page":0,"page_size":10}' \\
    --compressed 2>/dev/null );
my $response = `$curl`;

Thank you for your responses! You helped a lot!