Home > Enterprise >  how to match multiple items in perl
how to match multiple items in perl

Time:06-12

my $text ='<span>by <small  itemprop="author">J.K. Rowling</small><span>by <small  itemprop="author">J.K. Rowling</small><span>by <small  itemprop="author">J.K. Rowling</small>'


if ($text =~ m/<span>by <small  itemprop="author">(. ?)<\/small>/ig){
    $author = $1;
    $authorcount{$author}  =1;
}

$authorcounttxt = "authorcount.txt";
open (OUTPUT3, ">$authorcounttxt");
foreach $author (sort { $authorcount{$b} <=> $authorcount{$a} } keys %authorcount){
    print OUTPUT3 ("$author\t\t$authorcount{$author}\n");
}
close (OUTPUT3);

The desired output is:

J.K. Rowling 3

However I am only getting:

J.K. Rowling 1

CodePudding user response:

if ($text =~ m/.../ig){
     $author = $1;
     $authorcount{$author}  =1;

This is an if statement which means that the inner block while be entered at most once, i.e. if there is a first match. You likely meant to do a while statement to enter the inner block for each match:

while ($text =~ m/.../ig){
     $author = $1;
     $authorcount{$author}  =1;

CodePudding user response:

Replace your if with a while to iterate through all of the matches of your regex match instead of only the first one:

while ($text =~ m/<span>by <small  itemprop="author">(. ?)<\/small>/ig){
  $author = $1;
  $authorcount{$author}  = 1;
}

Also obligatory note: parsing HTML with regexen is fraught with peril. Consider using a module that can properly parse HTML, Mojo::DOM for example.

CodePudding user response:

As already indicated by previous posters the issue hidden in if ( $text =~ /.../gi ), it evaluates to true and block executed only once.

You are looking to process match in an array context which can be achieved with for or while loop.

Following code snippet demonstrates one of many approaches to the solution.

use strict;
use warnings;
use feature 'say';

my(%authors, $fname, $text, $re);

$fname = 'authorcount.txt';
$text  = '<span>by <small  itemprop="author">J.K. Rowling</small><span>by <small  itemprop="author">J.K. Rowling</small><span>by <small  itemprop="author">J.K. Rowling</small>';
$re    = qr/<span>by <small  itemprop="author">(.*?)<\/small>/;

$authors{$1}   for $text =~ /$re/gi;

open my $fh, ">$fname"
    or die "Can't open $fname";
    
say $fh "$_ $authors{$_}" for sort keys %authors;

close $fh;

NOTE: this code will work for your example $text = '...', if you intend to process complex HTML files then Mojo::DOM is a right tool to a problem.

  • Related