Home > database >  Perl XML::LibXML Get data outside of a tag
Perl XML::LibXML Get data outside of a tag

Time:12-30

As a followup question to my last (Perl XML::LibXML Getting info from specific nodes)

Given the following XML data, I can not figure out how to get the data that is shown after the <tab/> tag (which has no ending tag without getting all of the data from the child nodes from within the section? See below for more specifics:

XML Sample:

<title number="3">
<catchline>Uniform Agricultural Cooperative Association Act</catchline>
<chapter number="3-1">
<catchline>
General Provisions Relating to Agricultural Cooperative Associations
</catchline>
<section number="3-1-1">
<histories>
<history>
Amended by Chapter
<modchap sess="2010GS">378</modchap>
, 2010 General Session
</history>
<modyear>2010</modyear>
</histories>
<catchline>Declaration of policy.</catchline>
<tab/>
It is the declared policy of this state, as one means of improving the economic position of agriculture, to encourage the organization of producers of agricultural products into effective associations under the control of such producers, and to that end this act shall be liberally construed. THIS IS THE DATA THAT I WANT TO GET
</section>
<section number="3-1-1.1">
<histories>
<history>
Amended by Chapter
<modchap sess="1996GS">79</modchap>
, 1996 General Session
</history>
<modyear>1996</modyear>
</histories>
<catchline>General corporation laws do not apply.</catchline>
<tab/>
<xref depth="1" refnumber="16-10a" start="0">
Title 16, Chapter 10a, Utah Revised Business Corporation Act
</xref>
, does not apply to domestic or foreign corporations governed by this chapter, except as specifically provided in Sections
<xref depth="3" refnumber="3-1-13.4" start="0">3-1-13.4</xref>
,
<xref depth="3" refnumber="3-1-13.7" start="0">3-1-13.7</xref>
, and
<xref depth="3" refnumber="3-1-16.1" start="0">3-1-16.1</xref>
.
</section>
</chapter>
</title>

here is my current perl script:

!/usr/bin/perl -w


use XML::LibXML;


my $dom = XML::LibXML->load_xml(location => "file.xml");
my $titleName = $dom->findvalue('/title/catchline');
print "Title $titleName\n";

my @chapters = $dom->findnodes('/title/chapter');

for $chapter (@chapters) {
        my $chapterNo = $chapter->getAttribute('number');
        my $chapterName = $chapter->findvalue('catchline');
        print " Chapter #$chapterNo - $chapterName\n";

        my @sections = $chapter->findnodes('section');

        for $section (@sections) {
                my $sectionNo = $section->getAttribute('number');
                my $sectionName = $section->findvalue('catchline');
                my $sectionData = $section->textContent;
                print "  Section #$sectionNo - $sectionName\nSECDATA: $sectionData\n\n";

        }
}

This works, but what happens, is, probably exactly what I am asking for, it prints everything in the <section> for the $sectionData variable.

What I am trying to do is just get the data from after the <tab/> tag without anything else within a tag. Like the children tags of <histories><history><xref> etc..

So for instance, the string:

, does not apply to domestic or foreign corporations governed by this chapter, except as specifically provided in Sections

is not contained within any particular tag, how do I get to just that data?

The current output is:

Title Uniform Agricultural Cooperative Association Act
 Chapter #3-1 - 
General Provisions Relating to Agricultural Cooperative Associations

  Section #3-1-1 - Declaration of policy.
SECDATA: 


Amended by Chapter
378
, 2010 General Session

2010

Declaration of policy.

It is the declared policy of this state, as one means of improving the economic position of agriculture, to encourage the organization of producers of agricultural products into effective associations under the control of such producers, and to that end this act shall be liberally construed.


  Section #3-1-1.1 - General corporation laws do not apply.
SECDATA: 


Amended by Chapter
79
, 1996 General Session

1996

General corporation laws do not apply.


Title 16, Chapter 10a, Utah Revised Business Corporation Act

, does not apply to domestic or foreign corporations governed by this chapter, except as specifically provided in Sections
3-1-13.4
,
3-1-13.7
, and
3-1-16.1
.

But what I am looking for is something more like:

Title Uniform Agricultural Cooperative Association Act
 Chapter #3-1 - 
General Provisions Relating to Agricultural Cooperative Associations

  Section #3-1-1 - Declaration of policy.
SECDATA: 
It is the declared policy of this state, as one means of improving the economic position of agriculture, to encourage the organization of producers of agricultural products into effective associations under the control of such producers, and to that end this act shall be liberally construed.


  Section #3-1-1.1 - General corporation laws do not apply.
SECDATA: 
, does not apply to domestic or foreign corporations governed by this chapter, except as specifically provided in Sections

CodePudding user response:

If you wanted the text nodes that followed the tab element, you could use

my @post_tab_text_nodes = $section_node->findnodes('following-sibling:text()');

But what you want is a lot more complicated than that.

use List::Util  qw( first );
use XML::LibXML qw( XML_ELEMENT_NODE );

my @child_nodes = $section_node->childNodes();

my $tab_node_idx =
   first {
      my $node = $child_nodes[$_];
      (  $node->nodeType() == XML_ELEMENT_NODE
      && !defined( $node->namespaceURI() )
      && $node->nodeName() eq 'tab'
      )
   }
      0..$#child_nodes;

my @post_tab_children =
   defined($tab_node_idx)
      ? @child_nodes[ $tab_node_idx   1 .. $#child_nodes ]
      : ();

Rendering the resulting nodes as text is an exercise left to the user. You appear to have a mix of element nodes (XML_ELEMENT_NODE) and text nodes (XML_TEXT_NODE), which can be differentiated using $node->nodeType.

  • Related