I have thousands of strings arranged in following manner:
>String1 \(sub1|string1)
DIMMFIOYBVSYBE
EFMWUISCFIUBFMCIUOEFMIUIDEM
>String2 \(sub2|string2)
HVUYVMUYBOIIUMYTVU
SYOYVSCOUYCVUYVSUYVUSC
I want to extract some substrings based on certain conditions, such that the following substrings should be extracted. The substrings will be:
String1:
DIM X
M X
FIOYBVS
YBEEFM
WUISC X
FIUBFMC X
IUOEFM
IUIDEM
String2:
HVUYVM
UYBOIIUM
YTVUS
YOYVSC X
OUYCVUYVS X
UYVUSC X
The conditions are:
- substring should end with either
M
orS
M
orS
should not be succeeded byC
- Length of the substring should be
≥ 4
and≤ 8
charters
In the above example list of substring the marked X
will not be considered as they don't follow the criteria.
Expected output:
Name Frequency substrings
String1 4 FIOYBVS; YBEEFM; IUOEFM; IUIDEM
String2 3 HVUYVM; UYBOIIUM; YTVUS
I tried using the sliding window method mentioned here. It does not work for me. Any help appreciated.
CodePudding user response:
The following requires two passes, but it satisfies your stated conditions:
import re
for s in (
'DIMMFIOYBVSYBEEFMWUISCFIUBFMCIUOEFMIUIDEM',
'HVUYVMUYBOIIUMYTVUSYOYVSCOUYCVUYVSUYVUSC',
):
res = [w for w in re.findall('[^MS]*[MS]C?', s)
if 4 <= len(w) <= 8 and not w.endswith('C')]
print(res)
Output:
['FIOYBVS', 'YBEEFM', 'IUOEFM', 'IUIDEM']
['HVUYVM', 'UYBOIIUM', 'YTVUS']
CodePudding user response:
str_1 = 'DIMMFIOYBVSYBEEFMWUISCFIUBFMCIUOEFMIUIDEM'
str_2 = 'HVUYVMUYBOIIUMYTVUSYOYVSCOUYCVUYVSUYVUSC'
strings = [str_1, str_2]
for s in strings:
wrk = s
print(s)
print(40 * '-')
while len(wrk) > 0:
m = wrk.find('M') 1
s = wrk.find('S') 1
if (m < s and m > 0) or s <= 1:
pos = m
else:
pos = s
elem = wrk[ : pos]
if wrk[pos : pos 1] == 'C':
elem = 'C'
pos = 1
if len(elem) >= 4 and len(elem) <= 8 and elem[-1:] != 'C':
print(elem)
else:
print(elem (20-len(elem)) * ' ' 'X')
wrk = wrk[pos : ]
print()
#
# R e s u l t
#
'''
DIMMFIOYBVSYBEEFMWUISCFIUBFMCIUOEFMIUIDEM
----------------------------------------
DIM X
M X
FIOYBVS
YBEEFM
WUISC X
FIUBFMC X
IUOEFM
IUIDEM
HVUYVMUYBOIIUMYTVUSYOYVSCOUYCVUYVSUYVUSC
----------------------------------------
HVUYVM
UYBOIIUM
YTVUS
YOYVSC X
OUYCVUYVS X
UYVUSC X
'''
CodePudding user response:
Using 3 sed
calls and an awk
, this may work but may very well prove slower than python
$ cat script1.sed
/String/! {
/^[[:upper:]] $/ {
N;s/\n//
}
s/([^MS]*(M|S)C?)/&\n/g
}
$ cat script2.sed
/^>/! {
/\<[[:upper:]]{4,8}\>/!d
/(M|S)C/d
}
1iName\tFrequency\tsubstrings
$ cat script3.sed
s/>?([^\n]*)\n/\1 /g
s/String[0-9] /\n&\t/g
s/\\[^)]*\)//g
s/[[:upper:]] \>/&;/g
Using the scripts, and adding an awk
to count frequency, we can generate this;
sed -Ef script1.sed input_file | sed -Ef script2.sed | sed -Ezf script3.sed | awk '/;/{$2=NF-1"\t\t"$2}1'
Name Frequency substrings
String1 4 FIOYBVS; YBEEFM; IUOEFM; IUIDEM;
String2 3 HVUYVM; UYBOIIUM; YTVUS;