For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j )); do
let "n = 2 (${j} * ${LINES_PER_CONF})"
let "m = ${j} 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form: Line 1: Number of atoms Line 2: Title Line 3-Number of atoms: Molecular coordinates Line (number of atoms 1): same as line 1 Line (number of atoms 2): Title 2 ... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form: Line 1: conformer1 Energy Line 2: conformer2 Energy2 Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)
CodePudding user response:
If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000
file, whereas your script counts from 001
. If you know that 000
either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR
variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE
and ENDFILE
, which are similar to the standard BEGIN
and END
, but are executed around each processed file; you can do some calculations over the record and in ENDFILE
print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1
, and having an END
action for the last file.
CodePudding user response:
if you really wanna materialize all the file names without globbing you can always jot
it (it's like seq
with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.d' - 0 999 | mawk '_<(_ =(NR == _)*__)' \_=17 __=91 # extracting fixed interval # samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926