Is there a way to speed up the below shell script? It's taking me a good 40 mins to update about 150000 files everyday. Sure, given the volume of files to create & update, this may be acceptable. I don't deny that. However, if there is a much more efficient way to write this or re-write the logic entirely, I'm open to it. Please I'm looking for some help
#!/bin/bash
DATA_FILE_SOURCE="<path_to_source_data/${1}"
DATA_FILE_DEST="<path_to_dest>"
for fname in $(ls -1 "${DATA_FILE_SOURCE}")
do
for line in $(cat "${DATA_FILE_SOURCE}"/"${fname}")
do
FILE_TO_WRITE_TO=$(echo "${line}" | awk -F',' '{print $1"."$2".daily.csv"}')
CONTENT_TO_WRITE=$(echo "${line}" | cut -d, -f3-)
if [[ ! -f "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}" ]]
then
echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
else
if ! grep -Fxq "${CONTENT_TO_WRITE}" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
then
sed -i "/${1}/d" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
"${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
fi
fi
done
done
CodePudding user response:
There are still parts of your published script that are unclear like the sed
command. Although I rewrote it with saner practices and much less external calls witch should really speed it up.
#!/usr/bin/env sh
DATA_FILE_SOURCE="<path_to_source_data/$1"
DATA_FILE_DEST="<path_to_dest>"
for fname in "$DATA_FILE_SOURCE/"*; do
while IFS=, read -r a b content || [ "$a" ]; do
destfile="$DATA_FILE_DEST/$a.$b.daily.csv"
if grep -Fxq "$content" "$destfile"; then
sed -i "/$1/d" "$destfile"
fi
printf '%s\n' "$content" >>"$destfile"
done < "$fname"
done
CodePudding user response:
Make it parallel (as much as you can).
#!/bin/bash set -e -o pipefail declare -ir MAX_PARALLELISM=20 # pick a limit declare -i pid declare -a pids # ... for fname in "${DATA_FILE_SOURCE}/"*; do if ((${#pids[@]} >= MAX_PARALLELISM)); then wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2 unset 'pids[pid]' fi while IFS= read -r line; do FILE_TO_WRITE_TO="..." # ... done < "${fname}" & # forking here pids[$!]="${fname}" done for pid in "${!pids[@]}"; do wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2 done
Here’s a directly runnable skeleton showing how the harness above works (with 36 items to process and 20 parallel processes at most):
#!/bin/bash set -e -o pipefail declare -ir MAX_PARALLELISM=20 # pick a limit declare -i pid declare -a pids do_something_and_maybe_fail() { sleep $((RANDOM % 10)) return $((RANDOM % 2 * 5)) } for fname in some_name_{a..f}{0..5}.txt; do # 36 items if ((${#pids[@]} >= MAX_PARALLELISM)); then wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2 unset 'pids[pid]' fi do_something_and_maybe_fail & # forking here pids[$!]="${fname}" echo "${#pids[@]} running" 1>&2 done for pid in "${!pids[@]}"; do wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2 done
Strictly avoid external processes (such as
awk
,grep
andcut
) when processing one-liners for each line.fork()
ing is extremely inefficient in comparison to:- Running one single
awk
/grep
/cut
process on an entire input file (to preprocess all lines at once for easier processing inbash
) and feeding the whole output into (e.g.) abash
loop. - Using Bash expansions instead, where feasible, e.g.
"${line/,/.}"
and other tricks from theEXPANSION
section of theman bash
page, withoutfork()
ing any further processes.
- Running one single
Off-topic side notes:
ls -1
is unnecessary. First,ls
won’t write multiple columns unless the output is a terminal, so a plainls
would do. Second,bash
expansions are usually a cleaner and more efficient choice. (You can usenullglob
to correctly handle empty directories / “no match” cases.)Looping over the output from
cat
is a (less common) useless use ofcat
case. Feed the file into a loop inbash
instead and read it line by line. (This also gives you more line format flexibility.)