I have a longer list whose elements are multi-character symbols, for example:
@c = qw(iim v7 v7 iM iv7 iM **im iv7 iv7 bviiM** im biio iim bviim biiM biim bviM bviM ivm iih v7 v7 v7 iiim iiih vi7 iim v7 v7 iM iv7 iM **im iv7 bviiM** im biio iim bviim bviim iiio iim v7 v7 v7 vm i7 ivM iiih vi7);
I'd like to find the indices in this list that match sub-lists of the type S1 S2 S3 , where the " " means match one or more times. So, for example, the sub-list pattern (im iv7 bviiM)
would match both (im iv7 iv7 bviiM)
and (im iv7 bviiM)
as highlighted in bold above. The code would provide the indices 6, 7, 8, 9 for the first match and 32, 33, 34 for the second.
On the surface, this doesn't seem like it should be difficult, and I've tried implementing this with a variety of methods, including regexes, but so far it has beaten me. If there's a simple way to do this I'd be grateful for any help.
CodePudding user response:
Do you mean something like this?
#! /usr/bin/env perl
use warnings;
use strict;
use utf8;
use feature qw<say>;
use List::Util qw<any>;
my @sub_pat = qw(im iv7 bviiM);
my @c =
qw(
iim v7 v7 iM
iv7 iM im iv7
iv7 bviiM im biio
iim bviim biiM biim
bviM bviM ivm iih
v7 v7 v7 iiim
iiih vi7 iim v7
v7 iM iv7 iM
im iv7 bviiM im
biio iim bviim bviim
iiio iim v7
v7 v7 vm i7
ivM iiih vi7
);
my %ans = ();
while (my ($i, $k) = each @c) {
push @{$ans{$k}}, $i if any {$_ eq $k} @sub_pat;
}
while (my ($k, $v) = each %ans) {
say "$k @{$v}";
}
exit(0);
CodePudding user response:
An interesting problem, because repeated elements need also be matched by items from the given sub-sequence while the order need be maintained.
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);
my @words = qw(iim v7 v7 iM iv7 iM im iv7 iv7 bviiM im biio iim bviim biiM
biim bviM bviM ivm iih v7 v7 v7 iiim iiih vi7 iim v7 v7 iM iv7 iM im
iv7 bviiM im biio iim bviim bviim iiio iim v7 v7 v7 vm i7 ivM iiih vi7);
my @subseq = qw(im iv7 bviiM);
my (@all_seqs, @mi);
my $s = 0;
for my $i (0 .. $#words) {
if ($words[$i] eq $subseq[$s]) { # first in @subseq or repeated from @words
push @mi, $i;
}
elsif (@mi and $s == @subseq-1) { # done, exhausted @subseq
push @all_seqs, [ @mi ];
$s = 0;
@mi = ();
}
elsif (@mi and $words[$i] eq $subseq[ $s]) { # next in @subseq
push @mi, $i;
}
elsif (@mi) { # failed to match all from @subseq
$s = 0;
@mi = ();
}
}
dd \@all_seqs;
The @mi
is included in all tests after the first one so that they are done only when something has already been matched.
Prints
[[6 .. 9], [32, 33, 34]]
Uncomment printing lines to track its operation. This has been tested beyond the basic run above but not well enough.
Or, concatenate all words into a string and match the sub-sequence, concatenated into a pattern, by regex; then it's easy to take care of possible repetitions.
In order to also scoop up indices from the original array in a match I prepend each word by __INDEX__
.
# Same @words and @subseq from above
my $w = join '', map { '__'.$_.'__' . $words[$_] } 0.. $#words;
my $patt = '(' .
join('', map { '(?:' . '__[0-9] __' . quotemeta($_) . ') ' } @subseq) . ')';
my @seqs = $w =~ /$patt/g;
my @seqs_idx = map { [ /__([0-9] )__/g ] } @seqs;
dd \@seqs_idx;