I have an XML string in the following format that I need to tidy up before parsing it as follows, otherwise it fails with the error "Input is not proper UTF-8; indicate encoding":
my $xml_parsed_mess = XML::LibXML->new() -> parse_string($xml_mess);
The string is as follows:
my $xml_mess = "<?xml version="1.0" encoding="UTF-8"?><message><tag1>இந்தியாtest123</tag1><tag2>网络test网络</tag2><tag3>i am clean</tag3><tag4>do not worry about me</tag4></message></xml>"
I don't want to convert the entire string to UTF-8, but what I want to do is clean specific tags in the string, for example in this instance:
<tag1>இந்தியாtest123</tag1> -> <tag1>test123</tag1>
I know that the command to do this is:
$xml_mess =~ s/[[:^ascii:]] / /g;
But how do I target the contents of the specific fields such as <tag1>???</tag1>, <tag2>????</tag2>
.
I know I can change the contents as follows:
$xml_mess =~ s|<tag1>test</tag1>|<tag1>testing</tag1>
But how do I run this command against the contents of the tag $xml_mess =~ s/[[:^ascii:]] / /g;
instead of replacing the contents - and subsequently update $xml_mess.
CodePudding user response:
use 5.014;
use warnings;
use XML::LibXML qw( );
my $doc = XML::LibXML->new->parse_file("a.xml");
for my $text_node ($doc->findnodes("/message/tag1/text()")) {
$text_node->setData(
$text_node->getData() =~ s/[[:^ascii:]] / /rg
);
}
print $doc->toString;
<?xml version="1.0" encoding="UTF-8"?><message><tag1>இந்தியாtest123</tag1><tag2>网络test网络</tag2><tag3>i am clean</tag3><tag4>do not worry about me</tag4></message>
It produces
<?xml version="1.0" encoding="UTF-8"?><message><tag1> test123</tag1><tag2>网络test网络</tag2><tag3>i am clean</tag3><tag4>do not worry about me</tag4></message>