Home > Software design >  Speed up shell script/Performance enhancement of shell script
Speed up shell script/Performance enhancement of shell script

Time:09-16

Is there a way to speed up the below shell script? It's taking me a good 40 mins to update about 150000 files everyday. Sure, given the volume of files to create & update, this may be acceptable. I don't deny that. However, if there is a much more efficient way to write this or re-write the logic entirely, I'm open to it. Please I'm looking for some help

    #!/bin/bash
    
    DATA_FILE_SOURCE="<path_to_source_data/${1}"
    DATA_FILE_DEST="<path_to_dest>"
    
    for fname in $(ls -1 "${DATA_FILE_SOURCE}")
    do
        for line in $(cat "${DATA_FILE_SOURCE}"/"${fname}")
        do
            FILE_TO_WRITE_TO=$(echo "${line}" | awk -F',' '{print $1"."$2".daily.csv"}')
            CONTENT_TO_WRITE=$(echo "${line}" | cut -d, -f3-)
            if [[ ! -f "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}" ]]
            then
                echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
            else
                if ! grep -Fxq "${CONTENT_TO_WRITE}" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                then
                  sed -i "/${1}/d" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
"${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                    echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                fi
            fi
        done
    done

CodePudding user response:

There are still parts of your published script that are unclear like the sed command. Although I rewrote it with saner practices and much less external calls witch should really speed it up.

#!/usr/bin/env sh

DATA_FILE_SOURCE="<path_to_source_data/$1"
DATA_FILE_DEST="<path_to_dest>"

for fname in "$DATA_FILE_SOURCE/"*; do
  while IFS=, read -r a b content || [ "$a" ]; do
    destfile="$DATA_FILE_DEST/$a.$b.daily.csv"
    if grep -Fxq "$content" "$destfile"; then
        sed -i "/$1/d" "$destfile"
    fi
    printf '%s\n' "$content" >>"$destfile"
  done < "$fname"
done

CodePudding user response:

  1. Make it parallel (as much as you can).

    #!/bin/bash
    set -e -o pipefail
    
    declare -ir MAX_PARALLELISM=20  # pick a limit
    declare -i pid
    declare -a pids
    
    # ...
    
    for fname in "${DATA_FILE_SOURCE}/"*; do
      if ((${#pids[@]} >= MAX_PARALLELISM)); then
        wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
        unset 'pids[pid]'
      fi
    
      while IFS= read -r line; do
        FILE_TO_WRITE_TO="..."
        # ...
      done < "${fname}" &  # forking here
      pids[$!]="${fname}"
    done
    
    for pid in "${!pids[@]}"; do
      wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
    done
    

    Here’s a directly runnable skeleton showing how the harness above works (with 36 items to process and 20 parallel processes at most):

    #!/bin/bash
    set -e -o pipefail
    
    declare -ir MAX_PARALLELISM=20  # pick a limit
    declare -i pid
    declare -a pids
    
    do_something_and_maybe_fail() {
      sleep $((RANDOM % 10))
      return $((RANDOM % 2 * 5))
    }
    
    for fname in some_name_{a..f}{0..5}.txt; do  # 36 items
      if ((${#pids[@]} >= MAX_PARALLELISM)); then
        wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
        unset 'pids[pid]'
      fi
    
      do_something_and_maybe_fail &  # forking here
      pids[$!]="${fname}"
      echo "${#pids[@]} running" 1>&2
    done
    
    for pid in "${!pids[@]}"; do
      wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
    done
    
  2. Strictly avoid external processes (such as awk, grep and cut) when processing one-liners for each line. fork()ing is extremely inefficient in comparison to:

    • Running one single awk / grep / cut process on an entire input file (to preprocess all lines at once for easier processing in bash) and feeding the whole output into (e.g.) a bash loop.
    • Using Bash expansions instead, where feasible, e.g. "${line/,/.}" and other tricks from the EXPANSION section of the man bash page, without fork()ing any further processes.
  3. Off-topic side notes:

    • ls -1 is unnecessary. First, ls won’t write multiple columns unless the output is a terminal, so a plain ls would do. Second, bash expansions are usually a cleaner and more efficient choice. (You can use nullglob to correctly handle empty directories / “no match” cases.)

    • Looping over the output from cat is a (less common) useless use of cat case. Feed the file into a loop in bash instead and read it line by line. (This also gives you more line format flexibility.)

  • Related