Splitting a mol2 text file for every nth occurrences of a string in bash and printing to subfiles-CodePudding

I am trying to separate a text file containing millions of molecules in mol2 format using bash. An example of the code below is shown. The repeating string is @<TRIPOS>MOLECULE so I would like to use this to write subfiles every nth occurrence of it. Roughly every 30,000th time. I have tried to write this script below but nothing writes to subfiles

@<TRIPOS>MOLECULE
ZINC000169688935
54    57    1     0     0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1      C1     -1.3065    0.9799    0.3206 C.3      1 ZINC000169688935   -0.1500
2      C2     -1.4308   -0.5375    0.4726 C.3      1 ZINC000169688935   -0.0500
...
@<TRIPOS>MOLECULE
ZINC000022925452
49    49    1     0     0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400
2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000
...

#!/bin/bash

FILENAME="450mw_comp.mol2"
mol_counter="0"
file_counter="0"
new_file="$FILENAME.$file_counter"

cat $FILENAME
IFS=' '
while read -r line;
do
printf "$line" >> $new_file
if "$line" == *"@<TRIPOS>MOLECULE"*
then
     mol_counter=$((mol_counter 1))

if [ $mol_counter -eq 100 ]
then
    file_counter=$((file_counter 1))
    mol_counter=0
fi

done

CodePudding user response：

As mentioned in the comments parsing millions of lines in a bash loop is not recommended as it will be extremely slow.

NOTE: see 2nd half of answer for code to create sample input file

One awk idea:

fname='450mw_comp.mol2'
n=2                                                  # limit each output file to 2 molecules

awk -v n="${n}" '

/^@<TRIPOS>MOLECULE/ { molcount                      # pattern match? increment molecule counter and ...
                       if (molcount % n == 1) {      # if counter modulo n == 0 then ...
                          close(outfile)             # close current output file and ...
                          sfx=sprintf("d",  c)    # create new suffix (0-padded 5-digit #) and ...
                          outfile=FILENAME "." sfx   # create new output file name
                       }
                     }
                     { print $0 > outfile }
' "${fname}"

NOTES:

remove comments to declutter code
change bash variable n as needed (eg, n=30000)
adjust output file name suffix width () as needed

Verification:

$ fname='450mw_comp.mol2'

$ ls -l "${fname}"*
-rw-rw----  1 username None 2794 Feb 17 09:28 450mw_comp.mol2
-rw-rw----  1 username None  508 Feb 17 09:28 450mw_comp.mol2.00001
-rw-rw----  1 username None  508 Feb 17 09:28 450mw_comp.mol2.00002
-rw-rw----  1 username None  508 Feb 17 09:28 450mw_comp.mol2.00003
-rw-rw----  1 username None  508 Feb 17 09:28 450mw_comp.mol2.00004
-rw-rw----  1 username None  508 Feb 17 09:28 450mw_comp.mol2.00005
-rw-rw----  1 username None  254 Feb 17 09:28 450mw_comp.mol2.00006

$ cat "${fname}".00001
@<TRIPOS>MOLECULE
ZINC000022XXX001
49    49    1     0     0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400
2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000
@<TRIPOS>MOLECULE
ZINC000022XXX002
49    49    1     0     0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400
2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000

$ for f in "${fname}".*
do
    printf "\n###### ${f}\n\n"
    grep '^ZINC' "${f}"
done

###### 450mw_comp.mol2.00001

ZINC000022XXX001
ZINC000022XXX002

###### 450mw_comp.mol2.00002

ZINC000022XXX003
ZINC000022XXX004

###### 450mw_comp.mol2.00003

ZINC000022XXX005
ZINC000022XXX006

###### 450mw_comp.mol2.00004

ZINC000022XXX007
ZINC000022XXX008

###### 450mw_comp.mol2.00005

ZINC000022XXX009
ZINC000022XXX010

###### 450mw_comp.mol2.00006

ZINC000022XXX011

For demonstration purposes we'll generate an input file with 11 sample molecules; we'll terminate each ^ZINC... string with XXX plus a 0-padded 3-digit counter (we won't worry about updating the ATOM entries with matching ZINC... strings):

fname='450mw_comp.mol2'

awk '
BEGIN {
for (i=1;i<=11;i  )

printf "@<TRIPOS>MOLECULE\n\
ZINC000022XXXd\n\
49    49    1     0     0\n\
SMALL\n\
USER_CHARGES\n\
@<TRIPOS>ATOM\n\
1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400\n\
2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000\n",i
}
' > "${fname}"

Verification:

$ head "${fname}"
ZINC000022XXX001
49    49    1     0     0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400
2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000
@<TRIPOS>MOLECULE
ZINC000022XXX002

$ grep '^ZINC' "${fname}"
ZINC000022XXX001
ZINC000022XXX002
ZINC000022XXX003
ZINC000022XXX004
ZINC000022XXX005
ZINC000022XXX006
ZINC000022XXX007
ZINC000022XXX008
ZINC000022XXX009
ZINC000022XXX010
ZINC000022XXX011

CodePudding user response：

Using awk

$ awk -F">" 'BEGIN{add=0}/^@[^M]*MOLECULE/ {close(mol2); mol2=$2 add  } {print > mol2}' input_file

Output

$ cat MOLECULE0
@<TRIPOS>MOLECULE
ZINC000169688935
54    57    1     0     0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1      C1     -1.3065    0.9799    0.3206 C.3      1 ZINC000169688935   -0.1500
2      C2     -1.4308   -0.5375    0.4726 C.3      1 ZINC000169688935   -0.0500
...


$ cat MOLECULE1
@<TRIPOS>MOLECULE
ZINC000022925452
49    49    1     0     0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400
2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000
...