I'm trying to put the word (for sale) "عربي" in Arabic. But my terminal reverses itself from left to right. Knowing that Arabic is written from right to left. the word is equivalent to "llbye" but the terminal writes "eybll" (ﻊﻴﺒﻠﻟ).
use strict;
use warnings;
use utf8;
binmode( STDOUT, ':utf8' );
use Encode qw< encode decode >;
my $str = 'ﻟﻠﺒﻴﻊ'; # "for sale"
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );
my $decoded = pack 'U0W*', map ord, split //, $enc;
print "Original string : $str\n"; # ل ل ب ي ع
print "Decoded string 1: $dec\n"; # ل ل ب ي ع
print "Decoded string 2: $decoded\n"; # ل ل ب ي ع
my $k = reverse($decoded);
print "Decode reverse : $k\n";
print "0x$_" for unpack "H*", scalar reverse "$decoded\n";
On line 21, I'm trying to better visualize converting these characters to hexdump, but I receive:
Character in 'H' format wrapped in unpack at line 21.
Term[Perl]:# perl schreib.pl
Original string : ﻟﻠﺒﻴﻊ
Decoded string 1: ﻟﻠﺒﻴﻊ
Decoded string 2: ﻟﻠﺒﻴﻊ
Decode reverse : ﻊﻴﺒﻠﻟ
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21
enter link description here
As in the image, the first blank frame is what I copy and paste, and the terminal inverts without my permission. having to use reverse to print from right to left as in the second frame, as it should have been when pasted.
How do I transform these characters into hexadecimal?
CodePudding user response:
unpack H*
expects a string of bytes (characters with value 00..FF), but you have a string of Unicode Code Points (characters with value 000000..10FFFF).
You can use
sprintf "%vX", $str
which is effectively the same as
join ".", map sprintf( "%X", ord( $_ ) ), split //, $str
and
join ".", map sprintf( "%X", $_ ), unpack "W*", $str
All three work for any string (bytes, UCP, whatever).
For $str
, $dec
and $decoded
, the above produces
FEDF.FEE0.FE92.FEF4.FECA
For $enc
, the above produces
EF.BB.9F.EF.BB.A0.EF.BA.92.EF.BB.B4.EF.BB.8A
(You may get something different since our files might not be the same.)
With Unicode Code Points, we can use charnames (and/or Unicode::UCD) for more info.
use charnames qw( :full );
use feature qw( say );
for my $cp ( unpack "W*", $str ) {
my $ch = chr( $ucp );
if ( $ch =~ /(?[ \p{Print} - \p{Mark} ])/ ) { # Not sure if good enough.
printf "‹%s› ", $ch;
} else {
print "--- ";
}
printf "U %X ", $ucp;
say charnames::viacode( $ucp );
}
For $str
, $dec
and $decoded
, the above produces
‹ﻟ› U FEDF ARABIC LETTER LAM INITIAL FORM
‹ﻠ› U FEE0 ARABIC LETTER LAM MEDIAL FORM
‹ﺒ› U FE92 ARABIC LETTER BEH MEDIAL FORM
‹ﻴ› U FEF4 ARABIC LETTER YEH MEDIAL FORM
‹ﻊ› U FECA ARABIC LETTER AIN FINAL FORM
Data::Dumper with local $Data::Dumper::Useqq = 1;
will produce ASCII output as well.