I am trying to separate a text file containing millions of molecules in mol2 format using bash. An example of the code below is shown. The repeating string is @<TRIPOS>MOLECULE
so I would like to use this to write subfiles every nth occurrence of it. Roughly every 30,000th time. I have tried to write this script below but nothing writes to subfiles
@<TRIPOS>MOLECULE
ZINC000169688935
54 57 1 0 0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1 C1 -1.3065 0.9799 0.3206 C.3 1 ZINC000169688935 -0.1500
2 C2 -1.4308 -0.5375 0.4726 C.3 1 ZINC000169688935 -0.0500
...
@<TRIPOS>MOLECULE
ZINC000022925452
49 49 1 0 0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1 N1 -3.9528 1.8161 -2.8615 N.am 1 ZINC000022925452 -1.2400
2 S2 -2.7113 2.7835 -3.3766 S.o2 1 ZINC000022925452 2.7000
...
#!/bin/bash
FILENAME="450mw_comp.mol2"
mol_counter="0"
file_counter="0"
new_file="$FILENAME.$file_counter"
cat $FILENAME
IFS=' '
while read -r line;
do
printf "$line" >> $new_file
if "$line" == *"@<TRIPOS>MOLECULE"*
then
mol_counter=$((mol_counter 1))
if [ $mol_counter -eq 100 ]
then
file_counter=$((file_counter 1))
mol_counter=0
fi
done
CodePudding user response:
As mentioned in the comments parsing millions of lines in a bash
loop is not recommended as it will be extremely slow.
NOTE: see 2nd half of answer for code to create sample input file
One awk
idea:
fname='450mw_comp.mol2'
n=2 # limit each output file to 2 molecules
awk -v n="${n}" '
/^@<TRIPOS>MOLECULE/ { molcount # pattern match? increment molecule counter and ...
if (molcount % n == 1) { # if counter modulo n == 0 then ...
close(outfile) # close current output file and ...
sfx=sprintf("d", c) # create new suffix (0-padded 5-digit #) and ...
outfile=FILENAME "." sfx # create new output file name
}
}
{ print $0 > outfile }
' "${fname}"
NOTES:
- remove comments to declutter code
- change
bash
variablen
as needed (eg,n=30000
) - adjust output file name suffix width (
) as needed
Verification:
$ fname='450mw_comp.mol2'
$ ls -l "${fname}"*
-rw-rw---- 1 username None 2794 Feb 17 09:28 450mw_comp.mol2
-rw-rw---- 1 username None 508 Feb 17 09:28 450mw_comp.mol2.00001
-rw-rw---- 1 username None 508 Feb 17 09:28 450mw_comp.mol2.00002
-rw-rw---- 1 username None 508 Feb 17 09:28 450mw_comp.mol2.00003
-rw-rw---- 1 username None 508 Feb 17 09:28 450mw_comp.mol2.00004
-rw-rw---- 1 username None 508 Feb 17 09:28 450mw_comp.mol2.00005
-rw-rw---- 1 username None 254 Feb 17 09:28 450mw_comp.mol2.00006
$ cat "${fname}".00001
@<TRIPOS>MOLECULE
ZINC000022XXX001
49 49 1 0 0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1 N1 -3.9528 1.8161 -2.8615 N.am 1 ZINC000022925452 -1.2400
2 S2 -2.7113 2.7835 -3.3766 S.o2 1 ZINC000022925452 2.7000
@<TRIPOS>MOLECULE
ZINC000022XXX002
49 49 1 0 0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1 N1 -3.9528 1.8161 -2.8615 N.am 1 ZINC000022925452 -1.2400
2 S2 -2.7113 2.7835 -3.3766 S.o2 1 ZINC000022925452 2.7000
$ for f in "${fname}".*
do
printf "\n###### ${f}\n\n"
grep '^ZINC' "${f}"
done
###### 450mw_comp.mol2.00001
ZINC000022XXX001
ZINC000022XXX002
###### 450mw_comp.mol2.00002
ZINC000022XXX003
ZINC000022XXX004
###### 450mw_comp.mol2.00003
ZINC000022XXX005
ZINC000022XXX006
###### 450mw_comp.mol2.00004
ZINC000022XXX007
ZINC000022XXX008
###### 450mw_comp.mol2.00005
ZINC000022XXX009
ZINC000022XXX010
###### 450mw_comp.mol2.00006
ZINC000022XXX011
For demonstration purposes we'll generate an input file with 11 sample molecules; we'll terminate each ^ZINC...
string with XXX
plus a 0-padded 3-digit counter (we won't worry about updating the ATOM
entries with matching ZINC...
strings):
fname='450mw_comp.mol2'
awk '
BEGIN {
for (i=1;i<=11;i )
printf "@<TRIPOS>MOLECULE\n\
ZINC000022XXXd\n\
49 49 1 0 0\n\
SMALL\n\
USER_CHARGES\n\
@<TRIPOS>ATOM\n\
1 N1 -3.9528 1.8161 -2.8615 N.am 1 ZINC000022925452 -1.2400\n\
2 S2 -2.7113 2.7835 -3.3766 S.o2 1 ZINC000022925452 2.7000\n",i
}
' > "${fname}"
Verification:
$ head "${fname}"
ZINC000022XXX001
49 49 1 0 0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1 N1 -3.9528 1.8161 -2.8615 N.am 1 ZINC000022925452 -1.2400
2 S2 -2.7113 2.7835 -3.3766 S.o2 1 ZINC000022925452 2.7000
@<TRIPOS>MOLECULE
ZINC000022XXX002
$ grep '^ZINC' "${fname}"
ZINC000022XXX001
ZINC000022XXX002
ZINC000022XXX003
ZINC000022XXX004
ZINC000022XXX005
ZINC000022XXX006
ZINC000022XXX007
ZINC000022XXX008
ZINC000022XXX009
ZINC000022XXX010
ZINC000022XXX011
CodePudding user response:
Using awk
$ awk -F">" 'BEGIN{add=0}/^@[^M]*MOLECULE/ {close(mol2); mol2=$2 add } {print > mol2}' input_file
Output
$ cat MOLECULE0
@<TRIPOS>MOLECULE
ZINC000169688935
54 57 1 0 0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1 C1 -1.3065 0.9799 0.3206 C.3 1 ZINC000169688935 -0.1500
2 C2 -1.4308 -0.5375 0.4726 C.3 1 ZINC000169688935 -0.0500
...
$ cat MOLECULE1
@<TRIPOS>MOLECULE
ZINC000022925452
49 49 1 0 0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1 N1 -3.9528 1.8161 -2.8615 N.am 1 ZINC000022925452 -1.2400
2 S2 -2.7113 2.7835 -3.3766 S.o2 1 ZINC000022925452 2.7000
...