I'm reading an HTML file, trying to get some information out of it. I've tried HTML parsers, but can't figure out how to use them to get key text out. The original reads the html file, but this version is a minimal working example for StackOverflow purposes.
#!/usr/bin/env perl
use 5.036;
use warnings FATAL => 'all';
use autodie ':default';
use Devel::Confess 'color';
sub regex_test ( $string, $regex ) {
if ($string =~ m/$regex/s) {
say "$string matches $regex";
} else {
say "$string doesn't match $regex";
}
}
# the HTML text is $s
my $s = ' rs577952184 was merged into
<a target="_blank"
href="rs59222162">rs59222162</a>
';
regex_test ( $s, 'rs\d was merged into.*\<a target="_blank". href="rs(\d )/');
however, this doesn't match.
I think that the problem is the newline after "merged into" isn't matching.
How can I alter the above regex to match $s
?
CodePudding user response:
The problem is the trailing /
character in the $regex
, which should either be omitted or changed to "
CodePudding user response:
use strict;
use warnings;
use feature 'say';
my $s = ' rs577952184 was merged into
<a target="_blank"
href="rs59222162">rs59222162</a>
';
my $re = qr/rs\d was merged into\s <a target="_blank"\s href="rs(\d )">rs\d <\/a>/;
regex_test($s,$re);
exit 0;
sub regex_test {
my $string = shift;
my $regex = shift;
say $string =~ m/$regex/s
? "$string matches $regex"
: "$string doesn't match $regex";
}
Output
rs577952184 was merged into
<a target="_blank"
href="rs59222162">rs59222162</a>
matches (?^:rs\d was merged into\s <a target="_blank"\s href="rs(\d )">rs\d </a>)