Home > Software engineering >  How to find the indices of sub-list patterns in Perl
How to find the indices of sub-list patterns in Perl

Time:12-18

I have a longer list whose elements are multi-character symbols, for example:

@c = qw(iim v7 v7 iM iv7 iM **im iv7 iv7 bviiM** im biio iim bviim biiM biim bviM bviM ivm iih v7 v7 v7 iiim iiih vi7 iim v7 v7 iM iv7 iM **im iv7 bviiM** im biio iim bviim bviim iiio iim v7 v7 v7 vm i7 ivM iiih vi7); 

I'd like to find the indices in this list that match sub-lists of the type S1 S2 S3 , where the " " means match one or more times. So, for example, the sub-list pattern (im iv7 bviiM) would match both (im iv7 iv7 bviiM) and (im iv7 bviiM) as highlighted in bold above. The code would provide the indices 6, 7, 8, 9 for the first match and 32, 33, 34 for the second.

On the surface, this doesn't seem like it should be difficult, and I've tried implementing this with a variety of methods, including regexes, but so far it has beaten me. If there's a simple way to do this I'd be grateful for any help.

CodePudding user response:

Do you mean something like this?

#! /usr/bin/env perl

use warnings;
use strict;
use utf8;
use feature qw<say>;
use List::Util qw<any>;

my @sub_pat = qw(im iv7 bviiM);
my @c =
    qw(
    iim v7 v7 iM 
    iv7 iM im iv7 
    iv7 bviiM im biio 
    iim bviim biiM biim 
    bviM bviM ivm iih
    v7 v7 v7 iiim 
    iiih vi7 iim v7 
    v7 iM iv7 iM 
    im iv7 bviiM im 
    biio iim bviim bviim 
    iiio iim v7 
    v7 v7 vm i7 
    ivM iiih vi7
    );

my %ans = ();

while (my ($i, $k) = each @c) {
    push @{$ans{$k}}, $i if any {$_ eq $k} @sub_pat;
}

while (my ($k, $v) = each %ans) {
    say "$k @{$v}";
}

exit(0);

CodePudding user response:

An interesting problem, because repeated elements need also be matched by items from the given sub-sequence while the order need be maintained.

use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);

my @words = qw(iim v7 v7 iM iv7 iM im iv7 iv7 bviiM im biio iim bviim biiM
    biim bviM bviM ivm iih v7 v7 v7 iiim iiih vi7 iim v7 v7 iM iv7 iM im 
    iv7 bviiM im biio iim bviim bviim iiio iim v7 v7 v7 vm i7 ivM iiih vi7);

my @subseq = qw(im iv7 bviiM);

my (@all_seqs, @mi);
my $s = 0;

for my $i (0 .. $#words) { 
    if ($words[$i] eq $subseq[$s]) {  # first in @subseq or repeated from @words
        push @mi, $i;
    }   
    elsif (@mi and $s == @subseq-1) { # done, exhausted @subseq
        push @all_seqs, [ @mi ];  
        $s = 0;
        @mi = (); 
    }   
    elsif (@mi and $words[$i] eq $subseq[  $s]) { # next in @subseq
        push @mi, $i;
    }
    elsif (@mi) { # failed to match all from @subseq
        $s = 0;  
        @mi = ();
    }
}
dd \@all_seqs;

The @mi is included in all tests after the first one so that they are done only when something has already been matched.

Prints

[[6 .. 9], [32, 33, 34]]

Uncomment printing lines to track its operation. This has been tested beyond the basic run above but not well enough.


Or, concatenate all words into a string and match the sub-sequence, concatenated into a pattern, by regex; then it's easy to take care of possible repetitions. In order to also scoop up indices from the original array in a match I prepend each word by __INDEX__.

# Same @words and @subseq from above

my $w = join '', map { '__'.$_.'__' . $words[$_] } 0.. $#words;

my $patt = '(' . 
    join('', map { '(?:' . '__[0-9] __' . quotemeta($_) . ') ' } @subseq) . ')';

my @seqs = $w =~ /$patt/g;

my @seqs_idx = map { [ /__([0-9] )__/g ]  } @seqs;

dd \@seqs_idx;
  •  Tags:  
  • perl
  • Related