Home > Software engineering >  How to extract substrings based on multiple criteria?
How to extract substrings based on multiple criteria?

Time:07-30

I have thousands of strings arranged in following manner:

>String1 \(sub1|string1)
DIMMFIOYBVSYBE
EFMWUISCFIUBFMCIUOEFMIUIDEM

>String2 \(sub2|string2)
HVUYVMUYBOIIUMYTVU
SYOYVSCOUYCVUYVSUYVUSC

I want to extract some substrings based on certain conditions, such that the following substrings should be extracted. The substrings will be:

String1:

DIM        X
M          X
FIOYBVS
YBEEFM
WUISC      X
FIUBFMC    X
IUOEFM
IUIDEM

String2:

HVUYVM
UYBOIIUM
YTVUS
YOYVSC      X
OUYCVUYVS   X
UYVUSC      X

The conditions are:

  1. substring should end with either M or S
  2. M or S should not be succeeded by C
  3. Length of the substring should be ≥ 4 and ≤ 8 charters

In the above example list of substring the marked X will not be considered as they don't follow the criteria.

Expected output:

Name    Frequency    substrings
String1   4          FIOYBVS; YBEEFM; IUOEFM; IUIDEM
String2   3          HVUYVM; UYBOIIUM; YTVUS

I tried using the sliding window method mentioned here. It does not work for me. Any help appreciated.

CodePudding user response:

The following requires two passes, but it satisfies your stated conditions:

import re

for s in (
    'DIMMFIOYBVSYBEEFMWUISCFIUBFMCIUOEFMIUIDEM',
    'HVUYVMUYBOIIUMYTVUSYOYVSCOUYCVUYVSUYVUSC',
    ):
    res = [w for w in re.findall('[^MS]*[MS]C?', s)
           if 4 <= len(w) <= 8 and not w.endswith('C')]
    print(res)

Output:

['FIOYBVS', 'YBEEFM', 'IUOEFM', 'IUIDEM']
['HVUYVM', 'UYBOIIUM', 'YTVUS']

CodePudding user response:

str_1 = 'DIMMFIOYBVSYBEEFMWUISCFIUBFMCIUOEFMIUIDEM'
str_2 = 'HVUYVMUYBOIIUMYTVUSYOYVSCOUYCVUYVSUYVUSC'

strings = [str_1, str_2]
 
for s in strings:
    wrk = s
    print(s)
    print(40 * '-')
    while len(wrk) > 0:
        m = wrk.find('M')   1
        s = wrk.find('S')   1
        
        if (m < s and m > 0) or s <= 1:
            pos = m
        else:
            pos = s
        
        elem = wrk[ : pos]
        if wrk[pos : pos   1] == 'C':
            elem  = 'C'
            pos  = 1

        if len(elem) >= 4 and len(elem) <= 8 and elem[-1:] != 'C':
            print(elem)
        else:
            print(elem   (20-len(elem)) * ' '   'X')
            
        wrk = wrk[pos : ]
    print()
#
#   R e s u l t 
#
'''
DIMMFIOYBVSYBEEFMWUISCFIUBFMCIUOEFMIUIDEM
----------------------------------------
DIM                 X
M                   X
FIOYBVS
YBEEFM
WUISC               X
FIUBFMC             X
IUOEFM
IUIDEM

HVUYVMUYBOIIUMYTVUSYOYVSCOUYCVUYVSUYVUSC
----------------------------------------
HVUYVM
UYBOIIUM
YTVUS
YOYVSC              X
OUYCVUYVS           X
UYVUSC              X
'''

CodePudding user response:

Using 3 sed calls and an awk, this may work but may very well prove slower than python

$ cat script1.sed
/String/! {
    /^[[:upper:]] $/ {
        N;s/\n//
    }
    s/([^MS]*(M|S)C?)/&\n/g
}
$ cat script2.sed
/^>/! { 
    /\<[[:upper:]]{4,8}\>/!d
    /(M|S)C/d
} 
1iName\tFrequency\tsubstrings
$ cat script3.sed
s/>?([^\n]*)\n/\1 /g
s/String[0-9] /\n&\t/g
s/\\[^)]*\)//g
s/[[:upper:]] \>/&;/g

Using the scripts, and adding an awk to count frequency, we can generate this;

sed -Ef script1.sed input_file | sed -Ef script2.sed | sed -Ezf script3.sed | awk  '/;/{$2=NF-1"\t\t"$2}1'
Name    Frequency   substrings 
String1 4       FIOYBVS; YBEEFM; IUOEFM; IUIDEM;
String2 3       HVUYVM; UYBOIIUM; YTVUS;
  • Related