How to extract substrings based on multiple criteria?-CodePudding

I have thousands of strings arranged in following manner:

>String1 \(sub1|string1)
DIMMFIOYBVSYBE
EFMWUISCFIUBFMCIUOEFMIUIDEM

>String2 \(sub2|string2)
HVUYVMUYBOIIUMYTVU
SYOYVSCOUYCVUYVSUYVUSC

I want to extract some substrings based on certain conditions, such that the following substrings should be extracted. The substrings will be:

String1:

DIM        X
M          X
FIOYBVS
YBEEFM
WUISC      X
FIUBFMC    X
IUOEFM
IUIDEM

String2:

HVUYVM
UYBOIIUM
YTVUS
YOYVSC      X
OUYCVUYVS   X
UYVUSC      X

The conditions are:

substring should end with either M or S
M or S should not be succeeded by C
Length of the substring should be ≥ 4 and ≤ 8 charters

In the above example list of substring the marked X will not be considered as they don't follow the criteria.

Expected output:

Name    Frequency    substrings
String1   4          FIOYBVS; YBEEFM; IUOEFM; IUIDEM
String2   3          HVUYVM; UYBOIIUM; YTVUS

I tried using the sliding window method mentioned here. It does not work for me. Any help appreciated.

CodePudding user response：

The following requires two passes, but it satisfies your stated conditions:

import re

for s in (
    'DIMMFIOYBVSYBEEFMWUISCFIUBFMCIUOEFMIUIDEM',
    'HVUYVMUYBOIIUMYTVUSYOYVSCOUYCVUYVSUYVUSC',
    ):
    res = [w for w in re.findall('[^MS]*[MS]C?', s)
           if 4 <= len(w) <= 8 and not w.endswith('C')]
    print(res)

Output:

['FIOYBVS', 'YBEEFM', 'IUOEFM', 'IUIDEM']
['HVUYVM', 'UYBOIIUM', 'YTVUS']

CodePudding user response：

str_1 = 'DIMMFIOYBVSYBEEFMWUISCFIUBFMCIUOEFMIUIDEM'
str_2 = 'HVUYVMUYBOIIUMYTVUSYOYVSCOUYCVUYVSUYVUSC'

strings = [str_1, str_2]
 
for s in strings:
    wrk = s
    print(s)
    print(40 * '-')
    while len(wrk) > 0:
        m = wrk.find('M')   1
        s = wrk.find('S')   1
        
        if (m < s and m > 0) or s <= 1:
            pos = m
        else:
            pos = s
        
        elem = wrk[ : pos]
        if wrk[pos : pos   1] == 'C':
            elem  = 'C'
            pos  = 1

        if len(elem) >= 4 and len(elem) <= 8 and elem[-1:] != 'C':
            print(elem)
        else:
            print(elem   (20-len(elem)) * ' '   'X')
            
        wrk = wrk[pos : ]
    print()
#
#   R e s u l t 
#
'''
DIMMFIOYBVSYBEEFMWUISCFIUBFMCIUOEFMIUIDEM
----------------------------------------
DIM                 X
M                   X
FIOYBVS
YBEEFM
WUISC               X
FIUBFMC             X
IUOEFM
IUIDEM

HVUYVMUYBOIIUMYTVUSYOYVSCOUYCVUYVSUYVUSC
----------------------------------------
HVUYVM
UYBOIIUM
YTVUS
YOYVSC              X
OUYCVUYVS           X
UYVUSC              X
'''

CodePudding user response：

Using 3 sed calls and an awk, this may work but may very well prove slower than python

$ cat script1.sed
/String/! {
    /^[[:upper:]] $/ {
        N;s/\n//
    }
    s/([^MS]*(M|S)C?)/&\n/g
}

$ cat script2.sed
/^>/! { 
    /\<[[:upper:]]{4,8}\>/!d
    /(M|S)C/d
} 
1iName\tFrequency\tsubstrings

$ cat script3.sed
s/>?([^\n]*)\n/\1 /g
s/String[0-9] /\n&\t/g
s/\\[^)]*\)//g
s/[[:upper:]] \>/&;/g

Using the scripts, and adding an awk to count frequency, we can generate this;

sed -Ef script1.sed input_file | sed -Ef script2.sed | sed -Ezf script3.sed | awk  '/;/{$2=NF-1"\t\t"$2}1'
Name    Frequency   substrings 
String1 4       FIOYBVS; YBEEFM; IUOEFM; IUIDEM;
String2 3       HVUYVM; UYBOIIUM; YTVUS;