Splitting a large file containing multiple molecules-CodePudding

I have a file that contains 10,000 molecules. Each molecule is ending with keyword $$$$. I want to split the main files into 10,000 separate files so that each file will have only 1 molecule. Each molecule have different number of lines. I have tried sed on test_file.txt as:

sed '/$$$$/q' test_file.txt > out.txt

input:

$ cat test_file.txt
ashu
vishu
jyoti
$$$$
Jatin
Vishal
Shivani
$$$$

output:

$ cat out.txt
ashu
vishu
jyoti
$$$$

I can loop it through whole main file to create 10,000 separate files but how to delete the last molecule that was just moved to new file from main file. Or please suggest if there is a better method for it, which I believe there is. Thanks.

Edit1:

$ cat short_library.sdf

untitled.cdx
csChFnd80/09142214492D

 31 34  0  0  0  0  0  0  0  0999 V2000
    8.4660    6.2927    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.4660    4.8927    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2124    2.0951    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4249    2.7951    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
  2  3  1  0  0  0  0
 30 31  1  0  0  0  0
 31 26  1  0  0  0  0
M  END
>  <Mol_ID> (1)
1

>  <Formula> (1)
C22H24ClFN4O3

>  <URL> (1)
http://www.selleckchem.com/products/Gefitinib.html

$$$$
Dimesna.cdx
csChFnd80/09142214492D

 16 13  0  0  0  0  0  0  0  0999 V2000
    2.4249    1.4000    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    3.6415    2.1024    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.8540    1.4024    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4904    1.7512    0.0000 Na  0  3  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
  1 14  2  0  0  0  0
M  END
>  <Mol_ID> (2)
2

>  <Formula> (2)
C4H8Na2O6S4


>  <URL> (2)
http://www.selleckchem.com/products/Dimesna.html

$$$$

CodePudding user response：

Here's a simple solution with standard awk:

LANG=C awk '
    { mol = (mol == "" ? $0 : mol "\n" $0) }
    /^\$\$\$\$\r?$/ {
        outFile = "molecule"   fn ".sdf"
        print mol > outFile
        close(outFile) 
        mol = ""
    }
' input.sdf

CodePudding user response：

If you have csplit from GNU coreutils:

csplit -s -z -n5 -fmolecule test_file.txt '/^$$$$$/ 1' '{*}'

CodePudding user response：

This will do the whole job directly in bash:

molsplit.sh

#!/bin/bash

filenum=0
end=1

while read -r line; do
    if [[ $end -eq 1 ]]; then
        end=0
        filenum=$((filenum   1))
        exec 3>"molecule${filenum}.sdf"
    fi
    echo "$line" 1>&3
    if [[ "$line" = '$$$$' ]]; then
        end=1
        exec 3>&-
    fi
done

Input is read from stdin, though that would be easy enough to change. Something like this:

./molsplit.sh < test_file.txt

ADDENDUM

From subsequent commentary, it seems that the input file being processed has Windows line endings, whereas the processing environment's native line ending format is UNIX-style. In that case, if the line-termination style is to be preserved then we need to modify how the delimiters are recognized. For example, this variation on the above will recognize any line that starts with $$$$ as a molecule delimiter:

#!/bin/bash

filenum=0
end=1

while read -r line; do
    if [[ $end -eq 1 ]]; then
        end=0
        filenum=$((filenum   1))
        exec 3>"molecule${filenum}.sdf"
    fi
    echo "$line" 1>&3
    case $line in
      '$$$$'*) end=1; exec 3>&-;;
    esac
done

CodePudding user response：

You can do this with one command line using rquery (https://github.com/fuyuncat/rquery/releases).
The idea is using a dynamic variable to increase 1 when matched '$$$$', then you can use this dynamic variable as a part of the output filename, it will split the content.

[ rquery]$ ./rq -v "v1:0:tolong(@v1) switch(@raw,'\$\$\$\$',1,0)" -q "s @raw |>'/tmp/test' @v1 '.txt'" samples/short_library.sdf
[ rquery]$ cat /tmp/test0.txt
untitled.cdx
csChFnd80/09142214492D

 31 34  0  0  0  0  0  0  0  0999 V2000
    8.4660    6.2927    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.4660    4.8927    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2124    2.0951    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4249    2.7951    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
  2  3  1  0  0  0  0
 30 31  1  0  0  0  0
 31 26  1  0  0  0  0
M  END
>  <Mol_ID> (1)
1

>  <Formula> (1)
C22H24ClFN4O3

>  <URL> (1)
http://www.selleckchem.com/products/Gefitinib.html

$$$$
[ rquery]$ cat /tmp/test1.txt
Dimesna.cdx
csChFnd80/09142214492D

 16 13  0  0  0  0  0  0  0  0999 V2000
    2.4249    1.4000    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    3.6415    2.1024    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.8540    1.4024    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4904    1.7512    0.0000 Na  0  3  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
  1 14  2  0  0  0  0
M  END
>  <Mol_ID> (2)
2

>  <Formula> (2)
C4H8Na2O6S4


>  <URL> (2)
http://www.selleckchem.com/products/Dimesna.html

$$$$

CodePudding user response：

The same statement that sets the current output file name also closes the previous one. close(_)^_ here is same as close(_)^0, which ensures the filename always increments for the next one, even if the close() action resulted in an error.

— if the output file naming scheme allows for leading-edge zeros, then change that bit to close(_)^(_<_), which ALWAYS results in a 1, for any possible string or number, including all forms of zero, the empty string, inf-inities, and nans.

mawk2 'BEGIN { getline __<(_ = "/dev/null")
              ORS = RS = "[$][$][$][$][\r]?"(FS = RS)
               __*= gsub("[^$\n] ", __, ORS) 
      } NF  {
            print > (_ ="mol" (__ =close(_)^_) ".txt") }' test_file.txt

The first part about getline from /dev/null neither sets $0 | NF nor modifies NR | FNR, but it's existence ensures the first time close(_) is called it wouldn't error out.

gcat -n mol12345.txt

     1  Shivani
     2  jyoti
     3  Shivani
     4  $$$$

it was reasonably speedy - from 5.60 MB synthetic test file created 187,710 files in 11.652 secs.