How to use GNU parallel in bash while reading variables from stdin?-CodePudding

I'm trying to adapt the following lines of code for use with GNU parallel:

for ID in $(cut -f1 markers.tsv);
    do echo $ID;
    FAA=${ID}.faa.gz
    zcat ${FAA} | muscle -out ${ID}.msa
    done

Preferably without creating an intermediate script.

However, the examples I'm seeing here do not show where I can use my ${ID} argument.

This could be one a one liner:

for ID in $(cut -f1 markers.tsv);
    do echo $ID && FAA=${ID}.faa.gz && zcat ${FAA} | muscle -out ${ID}.msa
    done

I'm trying this but it appears to not be running the jobs simultaneously:

cut -f1 markers.tsv | parallel -j 16 -I @ 'echo "@" && FAA="@.faa.gz" && zcat $FAA | muscle -out @.msa'

Can someone help me adapt this using 16 jobs correctly?

Example markers.tsv

PF00709.21\t1\ta
PF00406.22\t2\tb
PF01808.18\t3\tc

CodePudding user response：

Due to a bug in GNU Parallel in input line cannot be longer that the maximal command line length.

cut -f1 markers.tsv |
  parallel -j16 'echo {} && zcat {}.faa.gz | muscle -out {}.msa'

CodePudding user response：

Something like

parallel --jobs 16 -a markers.tsv -C '\t' 'echo {1} && zcat {1}.faa.gz | muscle -out {1}.msa'

should work. Uses markers.tsv as the input file, with tab-separated columns, and replaces {1} in the command with the value of the first column when running the command for each line.

Since it sounds like the columns are really, really long, and you're running into maximum command line length restrictions, you might have more luck putting the bulk of what you want to do in a function (or script file):

# Assuming bash
dowork() {
    echo "$1"
    zcat "$1.faa.gz" | muscle -out "$1.msa"
}
export -f dowork
parallel --jobs 16 -a markers.tsv -C '\t' dowork '{1}'