Home > Net >  Perl: Converting strings to Unicode
Perl: Converting strings to Unicode

Time:09-17

I have a MySql database that stores strings with the Unicode characters encoded using an XML type format (i.e., &#nnnnn; ). An example of one of these strings would be: 概述 which represents the Unicode characters: 概述

Perl lets me make this conversion in my application if I hard-code the strings in the format:

\x{6982}\x{8ff0}
or even:
\N{U 6982}\N{U 8ff0}

To me it seems like a simple matter of changing the format from &#nnnnn; to \x{nnnn}
The Perl application seems to require hex numbers whereas the MySql is outputting integers.

I wanted to do this simple conversion in Regex. So I matched the integer using:

m/\&\#(\d{3,5});/;

Then I converted the match to hex using: sprintf('{x}',$1)
Then I added in the necessary: \x{ }

I was easily able to create strings that contained: "\x{6982}\x{8ff0}" But none of them were printed by the application as Unicode. They were simply printed as they were created: symbols and text.

I found out that if you hard-coded these strings into the program, Perl would "interpolate" them into Unicode characters. But if they were created as a string, the "interpolation" did not take place.

I tried to force the interpolation by using various functions such as:
Encode::decode('UTF-8', "some string" );
Encode::encode('UTF-8', "some string" );
But that wasn't what those functions were intended for.
I also tried to use Perl's manual string interpolation

$v="${ \($v) }";

But that did not convert the string "\x{6982}\x{8ff0}" into Unicode. It simply remained the same string as before.

I came across an example using "eval()".


while($unicodeString =~ m/\&\#(\d{3,5});/) {
    $_=$unicodeString;     ## in the XML form of (spaces added so you could see it here): & #27010; & #36848;
    m/\&\#(\d{3,5});/;     ## Matches the integer number in the Unicode
    my $y=q(\x).sprintf('{x}',$1); ## Converts the integer to hex and adds the \x{}
    my $v = eval qq{"$y"}; ## Performs the interpolation of the string to get the Unicode
    $unicodeString =~ s/\&\#(\d{3,5});/$v/;  ## Replaces the old code with the new Unicode character
}

This conversion works now. But I am not happy with the repeated use of eval() to convert each character: one-at-a-time. I could build my string in the While loop and then simply eval() the new string. But I would prefer to only eval() those small strings that were specifically matched in Regex.

Is there a better way of converting an XML string (with Unicode characters shown as integers) into a string that contains the actual Unicode characters?

How can I easily go from a string that contains:

我认识到自己的长处和短处,并追求自我发展。

to one with:
我认识到自己的长处和短处,并追求自我发展。

The documents I need to convert contain thousands of these characters.

CodePudding user response:

Here is a simple example of how you can replace the unicode escapes using the chr function:

use feature qw(say);
use strict;
use warnings;
use open qw( :encoding(utf-8) :std );

my $str = "概述";
$str =~ s/&#(\d );/chr $1/eg;
printf "%vX\n", $str;
say $str;

Output:

6982.8FF0
概述

CodePudding user response:

I didn't find a module that decode XML entities because they are normally only found in XML, and the XML parser handles them. But, it's pretty easy to recreate.

use feature qw( say state );

sub decode_xml_entities_inplace {
   state $ents = {
      amp  => "&",
      lt   => "<",
      gt   => ">",
      quot => '"',
      apos => "'",
   };
   
   $_[0] =~ s{
      &
      (?: \# (?: x([0-9a-fA-F] )
             |   ([0-9] )
             )
      |   (\w )
      )
      ;
   }{
      if    (defined($1)) { chr(hex($1))      }
      elsif (defined($2)) { chr($2)           }
      else                { $ents->{$3} // $& }
   }xeg;
}

my $s = "&#27010;&#36848;";
decode_xml_entities_inplace($s);
say $s;

Of course, if you simply need to handle the decimal numeric entities, the above simplifies to

use feature qw( state );

my $s = "&#27010;&#36848;";
$s =~ s{ &\# ([0-9] ) ; }{ chr($1) }xeg;
say $s;
  • Related