Speed up shell script/Performance enhancement of shell script-CodePudding

Is there a way to speed up the below shell script? It's taking me a good 40 mins to update about 150000 files everyday. Sure, given the volume of files to create & update, this may be acceptable. I don't deny that. However, if there is a much more efficient way to write this or re-write the logic entirely, I'm open to it. Please I'm looking for some help

    #!/bin/bash
    
    DATA_FILE_SOURCE="<path_to_source_data/${1}"
    DATA_FILE_DEST="<path_to_dest>"
    
    for fname in $(ls -1 "${DATA_FILE_SOURCE}")
    do
        for line in $(cat "${DATA_FILE_SOURCE}"/"${fname}")
        do
            FILE_TO_WRITE_TO=$(echo "${line}" | awk -F',' '{print $1"."$2".daily.csv"}')
            CONTENT_TO_WRITE=$(echo "${line}" | cut -d, -f3-)
            if [[ ! -f "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}" ]]
            then
                echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
            else
                if ! grep -Fxq "${CONTENT_TO_WRITE}" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                then
                  sed -i "/${1}/d" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
"${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                    echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                fi
            fi
        done
    done

CodePudding user response：

There are still parts of your published script that are unclear like the sed command. Although I rewrote it with saner practices and much less external calls witch should really speed it up.

#!/usr/bin/env sh

DATA_FILE_SOURCE="<path_to_source_data/$1"
DATA_FILE_DEST="<path_to_dest>"

for fname in "$DATA_FILE_SOURCE/"*; do
  while IFS=, read -r a b content || [ "$a" ]; do
    destfile="$DATA_FILE_DEST/$a.$b.daily.csv"
    if grep -Fxq "$content" "$destfile"; then
        sed -i "/$1/d" "$destfile"
    fi
    printf '%s\n' "$content" >>"$destfile"
  done < "$fname"
done

CodePudding user response：

Make it parallel (as much as you can).

#!/bin/bash
set -e -o pipefail

declare -ir MAX_PARALLELISM=20  # pick a limit
declare -i pid
declare -a pids

# ...

for fname in "${DATA_FILE_SOURCE}/"*; do
  if ((${#pids[@]} >= MAX_PARALLELISM)); then
    wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
    unset 'pids[pid]'
  fi

  while IFS= read -r line; do
    FILE_TO_WRITE_TO="..."
    # ...
  done < "${fname}" &  # forking here
  pids[$!]="${fname}"
done

for pid in "${!pids[@]}"; do
  wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
done

Here’s a directly runnable skeleton showing how the harness above works (with 36 items to process and 20 parallel processes at most):

#!/bin/bash
set -e -o pipefail

declare -ir MAX_PARALLELISM=20  # pick a limit
declare -i pid
declare -a pids

do_something_and_maybe_fail() {
  sleep $((RANDOM % 10))
  return $((RANDOM % 2 * 5))
}

for fname in some_name_{a..f}{0..5}.txt; do  # 36 items
  if ((${#pids[@]} >= MAX_PARALLELISM)); then
    wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
    unset 'pids[pid]'
  fi

  do_something_and_maybe_fail &  # forking here
  pids[$!]="${fname}"
  echo "${#pids[@]} running" 1>&2
done

for pid in "${!pids[@]}"; do
  wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
done

Strictly avoid external processes (such as awk, grep and cut) when processing one-liners for each line. fork()ing is extremely inefficient in comparison to:
- Running one single awk / grep / cut process on an entire input file (to preprocess all lines at once for easier processing in bash) and feeding the whole output into (e.g.) a bash loop.
- Using Bash expansions instead, where feasible, e.g. "${line/,/.}" and other tricks from the EXPANSION section of the man bash page, without fork()ing any further processes.
Off-topic side notes:
- ls -1 is unnecessary. First, ls won’t write multiple columns unless the output is a terminal, so a plain ls would do. Second, bash expansions are usually a cleaner and more efficient choice. (You can use nullglob to correctly handle empty directories / “no match” cases.)
- Looping over the output from cat is a (less common) useless use of cat case. Feed the file into a loop in bash instead and read it line by line. (This also gives you more line format flexibility.)