I have a MySql database that stores strings with the Unicode characters encoded using an XML type format (i.e., &#nnnnn; ). An example of one of these strings would be: 概述 which represents the Unicode characters: 概述
Perl lets me make this conversion in my application if I hard-code the strings in the format:
\x{6982}\x{8ff0}
or even:\N{U 6982}\N{U 8ff0}
To me it seems like a simple matter of changing the format from &#nnnnn; to \x{nnnn}
The Perl application seems to require hex numbers whereas the MySql is outputting integers.
I wanted to do this simple conversion in Regex. So I matched the integer using:
m/\&\#(\d{3,5});/;
Then I converted the match to hex using:
sprintf('{x}',$1)
Then I added in the necessary: \x{ }
I was easily able to create strings that contained: "\x{6982}\x{8ff0}"
But none of them were printed by the application as Unicode. They were simply printed as they were created: symbols and text.
I found out that if you hard-coded these strings into the program, Perl would "interpolate" them into Unicode characters. But if they were created as a string, the "interpolation" did not take place.
I tried to force the interpolation by using various functions such as:
Encode::decode('UTF-8', "some string" );
Encode::encode('UTF-8', "some string" );
But that wasn't what those functions were intended for.
I also tried to use Perl's manual string interpolation
$v="${ \($v) }";
But that did not convert the string "\x{6982}\x{8ff0}" into Unicode. It simply remained the same string as before.
I came across an example using "eval()".
while($unicodeString =~ m/\&\#(\d{3,5});/) {
$_=$unicodeString; ## in the XML form of (spaces added so you could see it here): & #27010; & #36848;
m/\&\#(\d{3,5});/; ## Matches the integer number in the Unicode
my $y=q(\x).sprintf('{x}',$1); ## Converts the integer to hex and adds the \x{}
my $v = eval qq{"$y"}; ## Performs the interpolation of the string to get the Unicode
$unicodeString =~ s/\&\#(\d{3,5});/$v/; ## Replaces the old code with the new Unicode character
}
This conversion works now. But I am not happy with the repeated use of eval() to convert each character: one-at-a-time. I could build my string in the While loop and then simply eval() the new string. But I would prefer to only eval() those small strings that were specifically matched in Regex.
Is there a better way of converting an XML string (with Unicode characters shown as integers) into a string that contains the actual Unicode characters?
How can I easily go from a string that contains:
我认识到自己的长处和短处,并追求自我发展。
to one with:
我认识到自己的长处和短处,并追求自我发展。
The documents I need to convert contain thousands of these characters.
CodePudding user response:
Here is a simple example of how you can replace the unicode escapes using the chr
function:
use feature qw(say);
use strict;
use warnings;
use open qw( :encoding(utf-8) :std );
my $str = "概述";
$str =~ s/&#(\d );/chr $1/eg;
printf "%vX\n", $str;
say $str;
Output:
6982.8FF0
概述
CodePudding user response:
I didn't find a module that decode XML entities because they are normally only found in XML, and the XML parser handles them. But, it's pretty easy to recreate.
use feature qw( say state );
sub decode_xml_entities_inplace {
state $ents = {
amp => "&",
lt => "<",
gt => ">",
quot => '"',
apos => "'",
};
$_[0] =~ s{
&
(?: \# (?: x([0-9a-fA-F] )
| ([0-9] )
)
| (\w )
)
;
}{
if (defined($1)) { chr(hex($1)) }
elsif (defined($2)) { chr($2) }
else { $ents->{$3} // $& }
}xeg;
}
my $s = "概述";
decode_xml_entities_inplace($s);
say $s;
Of course, if you simply need to handle the decimal numeric entities, the above simplifies to
use feature qw( state );
my $s = "概述";
$s =~ s{ &\# ([0-9] ) ; }{ chr($1) }xeg;
say $s;