Every day I have to pull around 100 repos, which I have automated with a script. This takes quite a long time when done sequentually, so I want to do this in parallel by sending each "git pull" to the background. This is my script:
#!/bin/bash
set -e -u
START=$(date %s)
export $(grep -v '^#' .env | xargs)
allRepos=$(curl \
-s \
-f \
--user "${githubUser:?}:${githubApiKey:?}" \
--header 'Accept: application/vnd.github.v3 json' \
--header 'Content-Type: application/json' \
--request GET https://internalgithuburl/api/v3/user/repos?per_page=1000 | \
jq -r '.[]' | \
grep ssh_url | \
sed 's/ "ssh_url": "//' | \
sed 's/",//'
)
amountAllRepos=$(echo "$allRepos" | wc -w | xargs)
echo "$amountAllRepos Repositories found."
cd "/home/mles/projectdir"
repoCounter=1
for repo in $allRepos; do
echo "Checking $repo ($repoCounter/$amountAllRepos)"
repoDirectory=$(basename "$repo" ".git")
if [ -d "$repoDirectory" ]; then
echo " Already cloned. Updating"
cd "$repoDirectory"
git reset --hard --quiet
git pull 2>&1 | sed 's/^/ /' &
cd ..
else
echo " New Repo found. Cloning"
git clone --quiet "$repo"
fi
repoCounter=$((repoCounter 1))
done
cd "/home/mles"
END=$(date %s)
DIFF=$(( $END - $START ))
echo "It took ${DIFF} seconds"
Notice the &
at the end of git pull
. If I run this, it looks like as it is still done sequentually:
Checking git@githubdomain:group/proxytest.git (1/97)
Already cloned. Updating
Checking git@githubdomain:group/buildserver-compose.git (2/97)
Already cloned. Updating
Already up to date.
Checking git@githubdomain:group/concourse-gate-resource.git (3/97)
Already cloned. Updating
Already up to date.
Checking git@githubdomain:group/concourse-teams.git (4/97)
Already cloned. Updating
Checking git@githubdomain:group/container-base-image.git (5/97)
Already cloned. Updating
Already up to date.
Already up to date.
Checking git@githubdomain:group/container-build-images.git (6/97)
Already cloned. Updating
Already up to date.
Checking git@githubdomain:group/container-build-npm.git (7/97)
Already cloned. Updating
Already up to date.
Checking git@githubdomain:group/container-elasticsearch.git (8/97)
Already cloned. Updating
Already up to date.
Checking git@githubdomain:group/container-grafana.git (9/97)
Already cloned. Updating
Already up to date.
Checking git@githubdomain:group/container-haproxy.git (10/97)
Already cloned. Updating
Already up to date.
Also it takes the same time as done sequentually. This script runs for 1896 seconds sequentually, and it takes 1804 seconds with this modification in parallel.
How can I really run the all the "git pull" commands in parallel?
CodePudding user response:
Did you try to also run git reset --hard --quiet
and git clone
in the background? Try maybe:
for repo in $allRepos; do
printf 'Checking %s (%d/%d)\n' "$repo" "$repoCounter" "$amountAllRepos"
repoDirectory="${repo%.git}"
repoDirectory="${repoDirectory##*/}"
if [ -d "$repoDirectory" ]; then
printf ' Already cloned. Updating\n'
{ git -C "$repoDirectory" reset --hard --quiet &&
git -C "$repoDirectory" pull 2>&1 |
sed 's/^/ /'; } &
else
printf ' New Repo found. Cloning\n'
git clone --quiet "$repo" &
fi
(( repoCounter = 1))
done
Note the use of bash builtin instead of basename
and the git -C
instead of cd
, this should probably speedup a bit.
But all this means that you will launch many jobs in parallel, probably more than the number of cores on your computer. There is a risk that it becomes less responsive. A better approach would probably be to use a parallel-capable tool like xargs
, GNU make
or GNU parallel
. It would allow you to control the maximum number of jobs that run at the same time. Example with xargs
:
# first define and export a function to process one repository
# $1: total, $2: repository, $3: count
updateRepo() {
printf 'Checking %s (%d/%d)\n' "$2" "$3" "$1"
dir="${2%.git}"; dir="${dir##*/}"
if [ -d "$dir" ]; then
printf ' Already cloned. Updating\n'
git -C "$dir" reset --hard --quiet
git -C "$dir" pull 2>&1 | sed 's/^/ /'
else
printf ' New Repo found. Cloning\n'
git clone --quiet "$2"
fi
}
export -f updateRepo
# build a bash indexed array (repos) in which 2 consecutive cells
# contain the repository and the count
declare -a repos=() tmp=( $allRepos )
declare -i n=1 t="${#tmp[@]}"
for repo in "${tmp[@]}"; do
repos =( "$repo" "$n" )
(( n = 1 ))
done
# pass the array to xargs, 2 entries at a time (-n2), with,
# e.g., up to 8 jobs in parallel (-P8), and let it use our
# function to process each repository
printf '%s\n' "${repos[@]}" | xargs -P8 -n2 bash -c 'updateRepo "$@"' "bash" "$t"
For each pair of array cells (repository
, count
) xargs
will launch bash
with script updateRepo "$@"
and parameters $0="bash"
, $1="$t"
, $2=repository
, $3=count
("$t"
is the total number of repositories computed above). It will first launch 8 of them and as soon as one finishes it will launch a new one, always keeping 8 running jobs, until the end.
Of course, whatever method you use, the output will be mixed up. If this is a problem you can modify the updateRepo
function to print only the first message on the standard output and send everything else in a separate log file (e.g., $3.log
).