Send "git pull" to background in a for loop in a bash script-CodePudding

Every day I have to pull around 100 repos, which I have automated with a script. This takes quite a long time when done sequentually, so I want to do this in parallel by sending each "git pull" to the background. This is my script:

#!/bin/bash

set -e -u

START=$(date  %s)

export $(grep -v '^#' .env | xargs)

allRepos=$(curl  \
  -s \
  -f \
  --user "${githubUser:?}:${githubApiKey:?}" \
  --header 'Accept: application/vnd.github.v3 json' \
  --header 'Content-Type: application/json' \
  --request GET https://internalgithuburl/api/v3/user/repos?per_page=1000 | \
  jq -r '.[]' | \
  grep ssh_url | \
  sed 's/  "ssh_url": "//' | \
  sed 's/",//'
  )
amountAllRepos=$(echo "$allRepos" | wc -w | xargs)

echo "$amountAllRepos Repositories found."

cd "/home/mles/projectdir"

repoCounter=1

for repo in $allRepos; do
  echo "Checking $repo ($repoCounter/$amountAllRepos)"

  repoDirectory=$(basename "$repo" ".git")

  if [ -d "$repoDirectory" ]; then
    echo "  Already cloned. Updating"
    cd "$repoDirectory"
    git reset --hard --quiet
    git pull 2>&1 | sed 's/^/  /' &
    cd ..
  else
    echo "  New Repo found. Cloning"
    git clone --quiet "$repo"
  fi

  repoCounter=$((repoCounter   1))

done

cd "/home/mles"

END=$(date  %s)
DIFF=$(( $END - $START ))
echo "It took ${DIFF} seconds"

Notice the & at the end of git pull. If I run this, it looks like as it is still done sequentually:

Checking git@githubdomain:group/proxytest.git (1/97)
  Already cloned. Updating
Checking git@githubdomain:group/buildserver-compose.git (2/97)
  Already cloned. Updating
  Already up to date.
Checking git@githubdomain:group/concourse-gate-resource.git (3/97)
  Already cloned. Updating
  Already up to date.
Checking git@githubdomain:group/concourse-teams.git (4/97)
  Already cloned. Updating
Checking git@githubdomain:group/container-base-image.git (5/97)
  Already cloned. Updating
  Already up to date.
  Already up to date.
Checking git@githubdomain:group/container-build-images.git (6/97)
  Already cloned. Updating
  Already up to date.
Checking git@githubdomain:group/container-build-npm.git (7/97)
  Already cloned. Updating
  Already up to date.
Checking git@githubdomain:group/container-elasticsearch.git (8/97)
  Already cloned. Updating
  Already up to date.
Checking git@githubdomain:group/container-grafana.git (9/97)
  Already cloned. Updating
  Already up to date.
Checking git@githubdomain:group/container-haproxy.git (10/97)
  Already cloned. Updating
  Already up to date.

Also it takes the same time as done sequentually. This script runs for 1896 seconds sequentually, and it takes 1804 seconds with this modification in parallel.

How can I really run the all the "git pull" commands in parallel?

CodePudding user response：

Did you try to also run git reset --hard --quiet and git clone in the background? Try maybe:

for repo in $allRepos; do

  printf 'Checking %s (%d/%d)\n' "$repo" "$repoCounter" "$amountAllRepos"

  repoDirectory="${repo%.git}"
  repoDirectory="${repoDirectory##*/}"

  if [ -d "$repoDirectory" ]; then
    printf '  Already cloned. Updating\n'
    { git -C "$repoDirectory" reset --hard --quiet &&
      git -C "$repoDirectory" pull 2>&1 |
      sed 's/^/  /'; } &
  else
    printf '  New Repo found. Cloning\n'
    git clone --quiet "$repo" &
  fi

  (( repoCounter  = 1))

done

Note the use of bash builtin instead of basename and the git -C instead of cd, this should probably speedup a bit.

But all this means that you will launch many jobs in parallel, probably more than the number of cores on your computer. There is a risk that it becomes less responsive. A better approach would probably be to use a parallel-capable tool like xargs, GNU make or GNU parallel. It would allow you to control the maximum number of jobs that run at the same time. Example with xargs:

# first define and export a function to process one repository
# $1: total, $2: repository, $3: count

updateRepo() {
  printf 'Checking %s (%d/%d)\n' "$2" "$3" "$1"
  dir="${2%.git}"; dir="${dir##*/}"
  if [ -d "$dir" ]; then
    printf '  Already cloned. Updating\n'
    git -C "$dir" reset --hard --quiet
    git -C "$dir" pull 2>&1 | sed 's/^/  /' 
  else
    printf '  New Repo found. Cloning\n'
    git clone --quiet "$2"
  fi
}
export -f updateRepo

# build a bash indexed array (repos) in which 2 consecutive cells
# contain the repository and the count

declare -a repos=() tmp=( $allRepos )
declare -i n=1 t="${#tmp[@]}"

for repo in "${tmp[@]}"; do
  repos =( "$repo" "$n" )
  (( n  = 1 ))
done

# pass the array to xargs, 2 entries at a time (-n2), with,
# e.g., up to 8 jobs in parallel (-P8), and let it use our
# function to process each repository

printf '%s\n' "${repos[@]}" | xargs -P8 -n2 bash -c 'updateRepo "$@"' "bash" "$t"

For each pair of array cells (repository, count) xargs will launch bash with script updateRepo "$@" and parameters $0="bash", $1="$t", $2=repository, $3=count ("$t" is the total number of repositories computed above). It will first launch 8 of them and as soon as one finishes it will launch a new one, always keeping 8 running jobs, until the end.

Of course, whatever method you use, the output will be mixed up. If this is a problem you can modify the updateRepo function to print only the first message on the standard output and send everything else in a separate log file (e.g., $3.log).