Home > Net >  Replacing non-ascii character character non-obtrusively between xml tags
Replacing non-ascii character character non-obtrusively between xml tags

Time:09-17

I have an XML string in the following format that I need to tidy up before parsing it as follows, otherwise it fails with the error "Input is not proper UTF-8; indicate encoding":

my $xml_parsed_mess = XML::LibXML->new() -> parse_string($xml_mess);

The string is as follows:

my $xml_mess = "<?xml version="1.0" encoding="UTF-8"?><message><tag1>இந்தியாtest123</tag1><tag2>网络test网络</tag2><tag3>i am clean</tag3><tag4>do not worry about me</tag4></message></xml>"

I don't want to convert the entire string to UTF-8, but what I want to do is clean specific tags in the string, for example in this instance:

<tag1>இந்தியாtest123</tag1> -> <tag1>test123</tag1>

I know that the command to do this is:

$xml_mess =~ s/[[:^ascii:]] / /g;

But how do I target the contents of the specific fields such as <tag1>???</tag1>, <tag2>????</tag2>.

I know I can change the contents as follows:

$xml_mess =~ s|<tag1>test</tag1>|<tag1>testing</tag1>

But how do I run this command against the contents of the tag $xml_mess =~ s/[[:^ascii:]] / /g; instead of replacing the contents - and subsequently update $xml_mess.

CodePudding user response:

use 5.014;
use warnings;

use XML::LibXML qw( );

my $doc = XML::LibXML->new->parse_file("a.xml");

for my $text_node ($doc->findnodes("/message/tag1/text()")) {
   $text_node->setData(
      $text_node->getData() =~ s/[[:^ascii:]] / /rg
   );
}

print $doc->toString;
<?xml version="1.0" encoding="UTF-8"?><message><tag1>இந்தியாtest123</tag1><tag2>网络test网络</tag2><tag3>i am clean</tag3><tag4>do not worry about me</tag4></message>

It produces

<?xml version="1.0" encoding="UTF-8"?><message><tag1> test123</tag1><tag2>网络test网络</tag2><tag3>i am clean</tag3><tag4>do not worry about me</tag4></message>
  •  Tags:  
  • perl
  • Related