In Perl, how to create a "mixed-encoding" string (or a raw sequence of bytes) in a scalar?-CodePudding

In a Perl script of mine, I have to write a mix of UTf-8 and raw bytes into files.

I have a big string in which everything is encoded as UTF-8. In that "source" string, UTF-8 characters are just like they should be (that is, UTF-8-valid byte sequences), while the "raw bytes" have been stored as if they were codepoints of the value held by the raw byte. So, in the source string, a "raw" byte of 0x50 would be stored as one 0x50 byte; whereas a "raw" byte of 0xff would be stored as a 0xc3 0xbf two-byte utf-8-valid sequence. When I write these "raw" bytes back, I need to put them back to single-byte form.

I have other data structures allowing me to know which parts of the string represent what kind of data. A list of fields, types, lengths, etc.

When writing in a plain file, I write each field in turn, either directly (if it's UTF-8) or by encoding its value to ISO-8859-1 if it's meant to be raw bytes. It works perfectly.

Now, in some cases, I need to write the value not directly to a file, but as a record of a BerkeleyDB (Btree, but that's mostly irrelevant) database. To do that, I need to write ALL the values that compose my record, in a single write operation. Which means that I need to have a scalar that holds a mix of UTF-8 and raw bytes.

Example:

Input Scalar (all hex values): 61 C3 8B 00 C3 BF

Expected Output Format: 2 UTF-8 characters, then 2 raw bytes.

Expected Output: 61 C3 8B 00 FF

At first, I created a string by concatenating the same values I was writing to my file from an empty string. And I tried writing that very string to a "standard" file without adding encoding. I got '?' characters instead of all my raw bytes over 0x7f (because, obviously, Perl decided to consider my string to be UTF-8).

Then, to try and tell Perl that it was already encoded, and to "please not try to be smart about it", I tried to encode the UTF-8 parts into "UTF-8", encode the binary parts into "ISO-8859-1", and concatenate everything. Then I wrote it. This time, the bytes looked perfect, but the parts which were already UTF-8 had been "double-encoded", that is, each byte of a multi-byte character had been seen as its codepoint...

I thought Perl wasn't supposed to re-encode "internal" UTF-8 into "encoded" UTF-8, if it was internally marked as UTF-8. The string holding all the values in UTF-8 comes from a C API, which sets the UTF-8 marker (or is supposed to, at the very least), to let Perl know it is already decoded.

Any idea about what I did miss there?

Is there a way to tell Perl what I want to do is just put a bunch of bytes one after another, and to please not try to interpret them in any way? The file I write to is opened as ">:raw" for that very reason, but I guess I need a way to specify that a given scalar is "raw" too?

Epilogue: I found the cause of the problem. The $bigInputString was supposed to be entirely composed of UTF-8 encoded data. But it did contain raw bytes with big values, because of a bug in C (turns out a "char" (not "unsigned char") is best tested with bitwise operators, instead of a " > 127"... ahem). So, "big" bytes weren't split into a two-bytes UTF-8 sequence, in the C API.

Which means the $bigInputString, created from the bad C data, didn't have the expected contents, and Perl rightfully didn't like it either.

After I corrected the bug, the string correctly encoded to UTF-8 (for the parts I wanted to keep as UTF-8) or LATIN-1 (for the "raw bytes" I wanted to convert back), and I got no further problems.

Sorry for wasting your time, guys. But I still learned some things, so I'll keep this here. Moral of the story, Devel::Peek is GOOD for debugging (thanks ikegami), and one should always double check, instead of assuming. Granted, I was in a hurry on friday, but the fault is still mine.

So, thanks to everyone who helped, or tried to, and special thanks to ikegami (again), who used quite a bit of his time helping me.

CodePudding user response：

Assuming you have a Unicode string where you know what each codepoint is supposed to be stored as - a UTF-8 sequence or a single byte, and a way to create a template string where each character represents what the corresponding one of the unicode string is supposed to use (U for UTF-8, C for single byte to keep things simple), you can use pack:

#!/usr/bin/env perl
use strict;
use warnings;

sub process {
    my ($str, $formats) = @_;
    my $template = "C0$formats";
    my @chars = map { ord } split(//, $str);
    pack $template, @chars;
}

my $str = "\x61\xC3\x8B\x00\xC3\xBF";
utf8::decode($str);
print process($str, "UUCC"); # Outputs 0x61 0xc3 0x8b 0x00 0xff

CodePudding user response：

So you have

my $in = "\x61\xC3\x8B\x00\xC3\xBF";

and you want

my $out = "\x61\xC3\x8B\x00\xFF";

This is the result of decoding only some parts of the input string, so you want something like the following:

sub decode_utf8 { my ($s) = @_; utf8::decode($s) or die("Invalid Input"); $s }

my $out = join "",
               substr($in, 0, 3),
   decode_utf8(substr($in, 3, 1)),
   decode_utf8(substr($in, 4, 2));

Tested.

Alternatively, you could decode the entire thing and re-encode the parts that should be encoded.

sub encode_utf8 { my ($s) = @_; utf8::encode($s); $s }

utf8::decode($in) or die("Invalid Input");
my $out = join "",
   encode_utf8(substr($in, 0, 2)),
               substr($in, 2, 1),
               substr($in, 3, 1);

Tested.

You have not indicate how you know which to decode and which not to decode, but you indicated you have this information.