I'm looking to parse a Wordpress blog export - I've used some XML::LibXML code successfully on my sample output of 3 blog entries, however I decided to try using XML:LibXML:Reader since I'm expecting to have to parse a very large file and I am concerned about running out of memory.
However, I'm getting some extra blank nodes.
The problem can be demonstrated using the following code and XML document:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use XML::LibXML::Reader;
my $filename = $ARGV[0];
my $reader = XML::LibXML::Reader->new(location => $filename) or die;
my $entry_pattern = 'XML::LibXML::Pattern'->new('/rss/channel/item');
while ($reader->nextPatternMatch($entry_pattern)) {
say "MATCH";
my $item = $reader->copyCurrentNode(1);
say $item;
say 'Title: ', $item->findvalue('./title');
say "";
}
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
<title>Blog</title>
<item><title>Title 1</title></item>
<item><title>Title 2</title></item>
</channel>
</rss>
The output obtained:
MATCH
<item><title>Title 1</title></item>
Title: Title 1
MATCH
<item/>
Title:
MATCH
<item><title>Title 2</title></item>
Title: Title 2
MATCH
<item/>
Title:
Note the extra <item/>
matches. Where do these come from? How can I avoid them?
CodePudding user response:
What seems to be happening is that the end tag is being matched. A pull/stream parser like ::Reader needs to signal both the start and end of elements, so this makes sense. Imagine if we ->copyCurrentNode
wasn't used.
However, we do use ->copyCurrentNode
, so we don't care about them or want them. So we'll simply have to skip them using the following:
next if $reader->nodeType != XML_READER_TYPE_ELEMENT;
or
next if $reader->nodeType == XML_READER_TYPE_END_ELEMENT;
Demo:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use XML::LibXML::Reader qw( XML_READER_TYPE_ELEMENT);
my $filename = $ARGV[0];
my $reader = XML::LibXML::Reader->new( location => $filename );
my $entry_pattern = 'XML::LibXML::Pattern'->new( '/rss/channel/item' );
while ( $reader->nextPatternMatch($entry_pattern) ) {
next if $reader->nodeType != XML_READER_TYPE_ELEMENT;
say "MATCH";
my $item = $reader->copyCurrentNode(1);
say $item;
say 'Title: ', $item->findvalue( './title' );
say "";
}
MATCH
<item><title>Title 1</title></item>
Title: Title 1
MATCH
<item><title>Title 2</title></item>
Title: Title 2