In a Perl script of mine, I have to write a mix of UTf-8 and raw bytes into files.
I have a big string in which everything is encoded as UTF-8. In that "source" string, UTF-8 characters are just like they should be (that is, UTF-8-valid byte sequences), while the "raw bytes" have been stored as if they were codepoints of the value held by the raw byte. So, in the source string, a "raw" byte of 0x50 would be stored as one 0x50 byte; whereas a "raw" byte of 0xff would be stored as a 0xc3 0xbf two-byte utf-8-valid sequence. When I write these "raw" bytes back, I need to put them back to single-byte form.
I have other data structures allowing me to know which parts of the string represent what kind of data. A list of fields, types, lengths, etc.
When writing in a plain file, I write each field in turn, either directly (if it's UTF-8) or by encoding its value to ISO-8859-1 if it's meant to be raw bytes. It works perfectly.
Now, in some cases, I need to write the value not directly to a file, but as a record of a BerkeleyDB (Btree, but that's mostly irrelevant) database. To do that, I need to write ALL the values that compose my record, in a single write operation. Which means that I need to have a scalar that holds a mix of UTF-8 and raw bytes.
Example:
Input Scalar (all hex values): 61 C3 8B 00 C3 BF
Expected Output Format: 2 UTF-8 characters, then 2 raw bytes.
Expected Output: 61 C3 8B 00 FF
At first, I created a string by concatenating the same values I was writing to my file from an empty string. And I tried writing that very string to a "standard" file without adding encoding. I got '?' characters instead of all my raw bytes over 0x7f (because, obviously, Perl decided to consider my string to be UTF-8).
Then, to try and tell Perl that it was already encoded, and to "please not try to be smart about it", I tried to encode the UTF-8 parts into "UTF-8", encode the binary parts into "ISO-8859-1", and concatenate everything. Then I wrote it. This time, the bytes looked perfect, but the parts which were already UTF-8 had been "double-encoded", that is, each byte of a multi-byte character had been seen as its codepoint...
I thought Perl wasn't supposed to re-encode "internal" UTF-8 into "encoded" UTF-8, if it was internally marked as UTF-8. The string holding all the values in UTF-8 comes from a C API, which sets the UTF-8 marker (or is supposed to, at the very least), to let Perl know it is already decoded.
Any idea about what I did miss there?
Is there a way to tell Perl what I want to do is just put a bunch of bytes one after another, and to please not try to interpret them in any way? The file I write to is opened as ">:raw" for that very reason, but I guess I need a way to specify that a given scalar is "raw" too?
Epilogue: I found the cause of the problem. The $bigInputString was supposed to be entirely composed of UTF-8 encoded data. But it did contain raw bytes with big values, because of a bug in C (turns out a "char" (not "unsigned char") is best tested with bitwise operators, instead of a " > 127"... ahem). So, "big" bytes weren't split into a two-bytes UTF-8 sequence, in the C API.
Which means the $bigInputString, created from the bad C data, didn't have the expected contents, and Perl rightfully didn't like it either.
After I corrected the bug, the string correctly encoded to UTF-8 (for the parts I wanted to keep as UTF-8) or LATIN-1 (for the "raw bytes" I wanted to convert back), and I got no further problems.
Sorry for wasting your time, guys. But I still learned some things, so I'll keep this here. Moral of the story, Devel::Peek is GOOD for debugging (thanks ikegami), and one should always double check, instead of assuming. Granted, I was in a hurry on friday, but the fault is still mine.
So, thanks to everyone who helped, or tried to, and special thanks to ikegami (again), who used quite a bit of his time helping me.
CodePudding user response:
Assuming you have a Unicode string where you know what each codepoint is supposed to be stored as - a UTF-8 sequence or a single byte, and a way to create a template string where each character represents what the corresponding one of the unicode string is supposed to use (U
for UTF-8, C
for single byte to keep things simple), you can use pack
:
#!/usr/bin/env perl
use strict;
use warnings;
sub process {
my ($str, $formats) = @_;
my $template = "C0$formats";
my @chars = map { ord } split(//, $str);
pack $template, @chars;
}
my $str = "\x61\xC3\x8B\x00\xC3\xBF";
utf8::decode($str);
print process($str, "UUCC"); # Outputs 0x61 0xc3 0x8b 0x00 0xff
CodePudding user response:
So you have
my $in = "\x61\xC3\x8B\x00\xC3\xBF";
and you want
my $out = "\x61\xC3\x8B\x00\xFF";
This is the result of decoding only some parts of the input string, so you want something like the following:
sub decode_utf8 { my ($s) = @_; utf8::decode($s) or die("Invalid Input"); $s }
my $out = join "",
substr($in, 0, 3),
decode_utf8(substr($in, 3, 1)),
decode_utf8(substr($in, 4, 2));
Alternatively, you could decode the entire thing and re-encode the parts that should be encoded.
sub encode_utf8 { my ($s) = @_; utf8::encode($s); $s }
utf8::decode($in) or die("Invalid Input");
my $out = join "",
encode_utf8(substr($in, 0, 2)),
substr($in, 2, 1),
substr($in, 3, 1);
You have not indicate how you know which to decode and which not to decode, but you indicated you have this information.