I need your help to learn the xml/text format files extraction my xml/txt files contain's a huge data as below mentioned format.
<authorList>
<author>
<fullName>Oliver LA</fullName>
<firstName>L A</firstName>
<lastName>Oliver</lastName>
<initials>LA</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>University of Liverpool, Liverpool, UK. Electronic address: [email protected].</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Hutton DP</fullName>
<firstName>D P</firstName>
<lastName>Hutton</lastName>
<initials>DP</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>North West Radiotherapy Operational Delivery Network, The Christie Hospital, Manchester, UK; University of Liverpool, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Hall T</fullName>
<firstName>T</firstName>
<lastName>Hall</lastName>
<initials>T</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>North West Radiotherapy Operational Delivery Network, The Christie Hospital, Manchester, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Cain M</fullName>
<firstName>M</firstName>
<lastName>Cain</lastName>
<initials>M</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>Clatterbridge Cancer Centre, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Bates M</fullName>
<firstName>M</firstName>
<lastName>Bates</lastName>
<initials>M</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>East of England Radiotherapy Network, Norfolk & Norwich University Hospital, Norwich, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Cree A</fullName>
<firstName>A</firstName>
<lastName>Cree</lastName>
<initials>A</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>Clatterbridge Cancer Centre, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Mullen E</fullName>
<firstName>E</firstName>
<lastName>Mullen</lastName>
<initials>E</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>Clatterbridge Cancer Centre, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
</authorList>
I need the output format like Email,firstName,lastname,affiliation and the output should be exported to a text file.
By using Perl software, I have developed a code which is mentioned below.
#!usr/bin/perl
use strict;
use warnings;
open(FILEHANDLE, "<data.xml")|| die "Can't open";
my @line;
my @affi;
my @lines;
my $ct =1 ;
print "Enter the start position:-";
my $start= <STDIN>;
print "Enter the end position:-";
my $end = <STDIN>;
print "Processing your data...\n";
my $i =0;
my $t =0;
while(<FILEHANDLE>)
{
if($ct>$end)
{
close(FILEHANDLE);
exit;
}
if($ct>=$start)
{
$lines[$t] = $_;
$t ;
}
if($ct == $end)
{
my $i = 0;
my $j = 0;
my @last;
my @first;
my $l = @lines;
my $s = 0;
while($j<$l)
{
if ($lines[$j] =~m/@/)
{
$line[$i] = $lines[$j];
$s = $j-3;
$first[$i]=$lines[$s];
$s--;
$last[$i] = $lines[$s];
#$j = $j 3;
#$last[$i]= $lines[$j];
#$j ;
#$first[$i] = $lines[$j];
$i ;
}
$j ;
}
my $k = 0;
foreach(@line)
{
$line[$k] =~ s/<.*>(.* )(.*@.*)<.*>/$2/;
$affi[$k] = $1;
$line[$k] = $2;
$line[$k] =~ s/\.$//;
$k ;
}
my $u = 0;
foreach(@first)
{
$first[$u] =~s/<firstName>(.*)<.*>/$1/;
$first[$u]=$l;
$u
}
my $m = 0;
foreach(@last)
{
$last[$m] =~s/<lastName>(.*)<.*>/$1/;
$last[$m] = $1;
$m
}
my $q=@line;
open(FILE,">RAVI.txt")|| die "can't open";
my $p;
for($p =0; $p<$q; $p )
{
print FILE "$line[$p],$first[$p],$last[$p],$affi[$p]\n";
}
close(FILE);
}
$ct ;
}
By using this code I am able to get output email, ,lastname,affiliation format.
I am not able to get the firstName by using the code from the given data. I am new to the Perl technology. I request you to please help me by fixing the mistakes in my code. Thank you in advance.
CodePudding user response:
As I said in comment, better use a known XML
parser. One of them is XML::XPath:
#!/usr/bin/perl
use strict; use warnings;
use feature qw/say/;
use XML::XPath;
my $file = shift;
my $xp = XML::XPath->new(filename => $file)
or die $!;
my $nodeset = $xp->find('/authorList//author');
foreach my $node ($nodeset->get_nodelist) {
my @contact;
push @contact, $node->findvalue('./firstName');
push @contact, $node->findvalue('./lastName');
$_ = $node->findvalue('.//authorAffiliation/affiliation');
push @contact, $& if m/\b\S \@\S /;
say join ", ", @contact;
}
Output
L A, Oliver, [email protected].
D P, Hutton
T, Hall
M, Cain
M, Bates
A, Cree
E, Mullen
Usage
./XML::XPath.pl file.xml | tee new_file.txt
CodePudding user response:
Your mistake was to try and write your own XML parser. That's a very hard thing to get right. Far better to use one that has already been written.
I always reach for XML::LibXML (it has terrible documentation, but there's a great tutorial online).
A first attempt at your program would look something like this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use XML::LibXML;
my $infile = shift
or die "Usage: $0 xml_file\n";
my $dom = XML::LibXML->load_xml(location => $infile);
my @nodes = qw[ firstName lastName
authorAffiliationDetailsList/authorAffiliation/affiliation ];
for my $author ($dom->findnodes('//author')) {
my @data = map { $author->findvalue($_) } @nodes;
say join ',', map { qq["$_"] } @data;
}
Note that I've put all of your output into quotes - that's because the affiliation node contains embedded commas.
In reality, you'd need to process the affiliation data a little more to extract the email address. But I hope this gets you most of the way to a solution.