I have millions of short input files. PyLauncher will run on supercomputers, running millions of python scripts in parallel. Each runs a program on each input and copies 2 lines from the output of each, then appends those 2 lines to results.txt. The python script looks like:
for input_file in directory:
subprocess.run(["script_name input_file | sed -n '22p; 39p' | tee -a results.txt"], shell=True)
results.txt will have 2*num_input_files (millions) of lines like:
Ligand: ./input/ZINC00001677.pdbqt
1 -8.288 0 0
Ligand: ./input/ZINC00001567.pdbqt
1 -10.86 0 0
Ligand: ./input/ZINC00001601.pdbqt
1 -7.721 0 0
I'd like to take this, rearrange, drop the 1, 0, and 0 from line 2, and sort so most negative number comes first so it looks like:
-10.86 ZINC00001567.pdbqt
-8.288 ZINC00001677.pdbqt
-7.721 ZINC00001601.pdbqt
I found this StackOverflow question: How do I sort two lines at a time in bash, using the second line as index?
But I can't quite get the commands to work for my file. Speed of execution is very important, so Bash commands or Python could both work, depending on which is faster. Thanks in advance!
CodePudding user response:
If you have enough RAM to store the output file contents then you could do this:
from os.path import basename
INPUTFILE = 'verylargefile.txt'
OUTPUTFILE = 'results.txt'
result = []
with open(INPUTFILE) as data:
while line := data.readline():
filename = basename(line.split()[-1])
v = data.readline().split()[1]
result.append(f'{v} {filename}\n')
with open(OUTPUTFILE, 'w') as data:
data.writelines(sorted(result, key=lambda x: float(x.split()[0])))
CodePudding user response:
In python I would do something like this:
with open('input.txt', 'r') as f_inp, open('output.txt', 'w') as f_out:
while True:
one = f_inp.readline().strip('\n')
if not one:
break
two = f_inp.readline().strip('\n')
f_out.write(f'{two} - {one}\n')
Then I would leave it to the sort
command to do the sort part.