Passing args to defined bash functions through GNU parallel-CodePudding

Let me show you a snippet of my Bash script and how I try to run parallel:

    parallel -a "$file" \
            -k \
            -j8 \
            --block 100M \
            --pipepart \
            --bar \
            --will-cite \
            _fix_col_number {} | _unify_null_value {} >> "$OUTPUT_DIR/$new_filename"

So, I am basically trying to process each line in a file in parallel using Bash functions defined inside my script. However, I am not sure how to pass each line to my defined functions "_fix_col_number" and "_unify_null_value". Whatever I do, nothing gets passed to the functions.

I am exporting the functions like this in my script:

declare -x NUM_OF_COLUMNS
export -f _fix_col_number
export -f _add_tabs
export -f _unify_null_value

The mentioned functions are:

_unify_null_value()
{
    _string=$(echo "$1" | perl -0777 -pe "s/(?<=\t)\.(?=\s)//g" | \
              perl -0777 -pe "s/(?<=\t)NA(?=\s)//g" | \
              perl -0777 -pe "s/(?<=\t)No Info(?=\s)//g")
    echo "$_string"
}

_add_tabs()
{
    _tabs=""

    for (( c=1; c<=$1; c   ))
    do
        _tabs="$_tabs\t"
    done

    echo -e "$_tabs"
}

_fix_col_number()
{
    line_cols=$(echo "$1" | awk -F"\t" '{ print NF }')

    if [[ $line_cols -gt $NUM_OF_COLUMNS ]]; then
        new_line=$(echo "$1" | cut -f1-"$NUM_OF_COLUMNS")
        echo -e "$new_line\n"
    elif [[ $line_cols -lt $NUM_OF_COLUMNS ]]; then
        missing_columns=$(( NUM_OF_COLUMNS - line_cols ))
        new_line="${1//$'\n'/}$(_add_tabs $missing_columns)"
        echo -e "$new_line\n"
    else
        echo -e "$1"
    fi
}

I tried removing {} from parallel. Not really sure what I am doing wrong.

CodePudding user response：

I see two problems in the invocation plus additional problems with the functions:

With --pipepart there are no arguments. The blocks read from -a file are passed over stdin to your functions. Try the following commands to confirm this:
seq 9 > file
parallel -a file --pipepart echo
parallel -a file --pipepart cat
Theoretically, you could read stdin into a variable and pass that variable to your functions, ...
parallel -a file --pipepart 'b=$(cat); someFunction "$b"'
... but I wouldn't recommend it, especially since your blocks are 100MB each.
Bash interprets the pipe | in your command before parallel even sees it. To run a pipe, quote the entire command:
parallel ... 'b=$(cat); _fix_col_number "$b" | _unify_null_value "$b"' >> ...
_fix_col_number seems to assume its argument to be a single line, but receives 100MB blocks instead.
_unify_null_value does not read stdin, so _fix_col_number {} | _unify_null_value {} is equivalent to _unify_null_value {}.

That being said, your functions can be drastically improved. They start a lot of processes which becomes incredibly expensive for larger files. You can do some trivial improvements like combining perl ... | perl ... | perl ... into a single perl. Likewise, instead of storing everything in variables, you can process stdin directly: Just use f() { cmd1 | cmd2; } instead of f() { var=$(echo "$1" | cmd1); var=$(echo "$var" | cmd2); echo "$var"; }.
However, don't waste time on small things like these. A complete rewrite in sed, awk, or perl is easy and should outperfom every optimization on the existing functions.

Try

n="INSERT NUMBER OF COLUMNS HERE"
tabs=$(perl -e "print \"\t\" x $n")
perl -pe "s/\r?\$/$tabs/; s/\t\K(\.|NA|No Info)(?=\s)//g;" file |
cut -f "1-$n"

If you still find this too slow, leave out file; pack the command it into a function, export that function and then call parallel -a file -k --pipepart nameOfTheFunction. The option --block is not necessary as pipepart will evenly split the input based on the number of jobs (can be specified with -j).