I have a file that contains 10,000 molecules. Each molecule is ending with keyword $$$$. I want to split the main files into 10,000 separate files so that each file will have only 1 molecule. Each molecule have different number of lines. I have tried sed
on test_file.txt
as:
sed '/$$$$/q' test_file.txt > out.txt
input:
$ cat test_file.txt
ashu
vishu
jyoti
$$$$
Jatin
Vishal
Shivani
$$$$
output:
$ cat out.txt
ashu
vishu
jyoti
$$$$
I can loop it through whole main file to create 10,000 separate files but how to delete the last molecule that was just moved to new file from main file. Or please suggest if there is a better method for it, which I believe there is. Thanks.
Edit1:
$ cat short_library.sdf
untitled.cdx
csChFnd80/09142214492D
31 34 0 0 0 0 0 0 0 0999 V2000
8.4660 6.2927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.4660 4.8927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2124 2.0951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4249 2.7951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
2 3 1 0 0 0 0
30 31 1 0 0 0 0
31 26 1 0 0 0 0
M END
> <Mol_ID> (1)
1
> <Formula> (1)
C22H24ClFN4O3
> <URL> (1)
http://www.selleckchem.com/products/Gefitinib.html
$$$$
Dimesna.cdx
csChFnd80/09142214492D
16 13 0 0 0 0 0 0 0 0999 V2000
2.4249 1.4000 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
3.6415 2.1024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8540 1.4024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.4904 1.7512 0.0000 Na 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
1 14 2 0 0 0 0
M END
> <Mol_ID> (2)
2
> <Formula> (2)
C4H8Na2O6S4
> <URL> (2)
http://www.selleckchem.com/products/Dimesna.html
$$$$
CodePudding user response:
Here's a simple solution with standard awk
:
LANG=C awk '
{ mol = (mol == "" ? $0 : mol "\n" $0) }
/^\$\$\$\$\r?$/ {
outFile = "molecule" fn ".sdf"
print mol > outFile
close(outFile)
mol = ""
}
' input.sdf
CodePudding user response:
If you have csplit
from GNU coreutils:
csplit -s -z -n5 -fmolecule test_file.txt '/^$$$$$/ 1' '{*}'
CodePudding user response:
This will do the whole job directly in bash
:
molsplit.sh
#!/bin/bash
filenum=0
end=1
while read -r line; do
if [[ $end -eq 1 ]]; then
end=0
filenum=$((filenum 1))
exec 3>"molecule${filenum}.sdf"
fi
echo "$line" 1>&3
if [[ "$line" = '$$$$' ]]; then
end=1
exec 3>&-
fi
done
Input is read from stdin, though that would be easy enough to change. Something like this:
./molsplit.sh < test_file.txt
ADDENDUM
From subsequent commentary, it seems that the input file being processed has Windows line endings, whereas the processing environment's native line ending format is UNIX-style. In that case, if the line-termination style is to be preserved then we need to modify how the delimiters are recognized. For example, this variation on the above will recognize any line that starts with $$$$
as a molecule delimiter:
#!/bin/bash
filenum=0
end=1
while read -r line; do
if [[ $end -eq 1 ]]; then
end=0
filenum=$((filenum 1))
exec 3>"molecule${filenum}.sdf"
fi
echo "$line" 1>&3
case $line in
'$$$$'*) end=1; exec 3>&-;;
esac
done
CodePudding user response:
You can do this with one command line using rquery
(https://github.com/fuyuncat/rquery/releases).
The idea is using a dynamic variable to increase 1 when matched '$$$$', then you can use this dynamic variable as a part of the output filename, it will split the content.
[ rquery]$ ./rq -v "v1:0:tolong(@v1) switch(@raw,'\$\$\$\$',1,0)" -q "s @raw |>'/tmp/test' @v1 '.txt'" samples/short_library.sdf
[ rquery]$ cat /tmp/test0.txt
untitled.cdx
csChFnd80/09142214492D
31 34 0 0 0 0 0 0 0 0999 V2000
8.4660 6.2927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.4660 4.8927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2124 2.0951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4249 2.7951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
2 3 1 0 0 0 0
30 31 1 0 0 0 0
31 26 1 0 0 0 0
M END
> <Mol_ID> (1)
1
> <Formula> (1)
C22H24ClFN4O3
> <URL> (1)
http://www.selleckchem.com/products/Gefitinib.html
$$$$
[ rquery]$ cat /tmp/test1.txt
Dimesna.cdx
csChFnd80/09142214492D
16 13 0 0 0 0 0 0 0 0999 V2000
2.4249 1.4000 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
3.6415 2.1024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8540 1.4024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.4904 1.7512 0.0000 Na 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
1 14 2 0 0 0 0
M END
> <Mol_ID> (2)
2
> <Formula> (2)
C4H8Na2O6S4
> <URL> (2)
http://www.selleckchem.com/products/Dimesna.html
$$$$
CodePudding user response:
The same statement that sets the current output file name also closes the previous one. close(_)^_
here is same as close(_)^0
, which ensures the filename always increments for the next one, even if the close()
action resulted in an error.
— if the output file naming scheme allows for leading-edge zeros, then change that bit to close(_)^(_<_)
, which ALWAYS results in a 1, for any possible string or number, including all forms of zero, the empty string, inf
-inities, and nan
s.
mawk2 'BEGIN { getline __<(_ = "/dev/null") ORS = RS = "[$][$][$][$][\r]?"(FS = RS) __*= gsub("[^$\n] ", __, ORS) } NF { print > (_ ="mol" (__ =close(_)^_) ".txt") }' test_file.txt
The first part about getline
from /dev/null
neither sets $0 | NF
nor modifies NR | FNR
, but it's existence ensures the first time close(_)
is called it wouldn't error out.
gcat -n mol12345.txt
1 Shivani
2 jyoti
3 Shivani
4 $$$$
it was reasonably speedy - from 5.60 MB
synthetic test file created 187,710
files in 11.652 secs
.