Home > Enterprise >  Extra empty elements when using XML::LibXML::Reader instead of XML::LibXML
Extra empty elements when using XML::LibXML::Reader instead of XML::LibXML

Time:03-24

I'm looking to parse a Wordpress blog export - I've used some XML::LibXML code successfully on my sample output of 3 blog entries, however I decided to try using XML:LibXML:Reader since I'm expecting to have to parse a very large file and I am concerned about running out of memory.

However, I'm getting some extra blank nodes.

The problem can be demonstrated using the following code and XML document:

#!/usr/bin/perl
use 5.010;
use strict;
use warnings;

use XML::LibXML::Reader;

my $filename = $ARGV[0];

my $reader = XML::LibXML::Reader->new(location => $filename) or die;

my $entry_pattern = 'XML::LibXML::Pattern'->new('/rss/channel/item');

while ($reader->nextPatternMatch($entry_pattern)) {
    say "MATCH";
    my $item = $reader->copyCurrentNode(1);
    say $item;
    say 'Title: ', $item->findvalue('./title');
    say "";
}
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
   <channel>
      <title>Blog</title>
      <item><title>Title 1</title></item>
      <item><title>Title 2</title></item>
   </channel>
</rss>

The output obtained:

MATCH
<item><title>Title 1</title></item>
Title: Title 1

MATCH
<item/>
Title:

MATCH
<item><title>Title 2</title></item>
Title: Title 2

MATCH
<item/>
Title:

Note the extra <item/> matches. Where do these come from? How can I avoid them?

CodePudding user response:

What seems to be happening is that the end tag is being matched. A pull/stream parser like ::Reader needs to signal both the start and end of elements, so this makes sense. Imagine if we ->copyCurrentNode wasn't used.

However, we do use ->copyCurrentNode, so we don't care about them or want them. So we'll simply have to skip them using the following:

next if $reader->nodeType != XML_READER_TYPE_ELEMENT;

or

next if $reader->nodeType == XML_READER_TYPE_END_ELEMENT;

Demo:

#!/usr/bin/perl
use 5.010;
use strict;
use warnings;

use XML::LibXML::Reader qw( XML_READER_TYPE_ELEMENT);

my $filename = $ARGV[0];

my $reader = XML::LibXML::Reader->new( location => $filename );

my $entry_pattern = 'XML::LibXML::Pattern'->new( '/rss/channel/item' );

while ( $reader->nextPatternMatch($entry_pattern) ) {
    next if $reader->nodeType != XML_READER_TYPE_ELEMENT;

    say "MATCH";
    my $item = $reader->copyCurrentNode(1);
    say $item;
    say 'Title: ', $item->findvalue( './title' );
    say "";
}
MATCH
<item><title>Title 1</title></item>
Title: Title 1

MATCH
<item><title>Title 2</title></item>
Title: Title 2

  • Related