Home > OS >  Git: Get all changed files since the last push
Git: Get all changed files since the last push

Time:07-26

I am writing a Gitlab CI pipeline and I am trying to find all files that have changed since the last push to the remote git repository.

I know how to get all files that were changed in the last commit but if more than one commit was pushed at the same time I can still only find the changes of the last one.

CodePudding user response:

The best way to ensure you're covering all commits in a push would be to write a pre-receive hook, which will have access to all commits in every push. This is especially important if you must consider that it's possible for history rewrites to happen.

But if you must use a pipeline job to do this, one way would be to cache (or otherwise store/retrieve) the last seen commit ref and use that as your reference.

This approach will not be as robust (but could be made to be robust) as a pre-receive hook because, among other issues:

  1. Pipelines can be rerun on old commit hashes
  2. The clone depth may not be large enough to retrieve the necessary commits (you can fix this with GIT_DEPTH, but it's a consideration)
  3. History can be rewritten
  4. commits with older timestamps can be pushed after commits with newer timestamps (timestamps are also somewhat arbitrary since they can be set by the committer)
  5. Different branches may have different/diverged histories
  6. Pipelines can be skipped in a variety of circumstances

But an implementation of this general idea may look something like this:

my_job:
  cache:
    key: last-push  # or consider keying on `CI_COMMIT_BRANCH` or similar
    paths:
      - "last-push.txt"
  rules:
    - if: "$CI_COMMIT_BRANCH"
  script:
    - |
      if [[ -f "last-push.txt" ]]; then
          source last-push.txt
      else
          echo "LAST_CI_COMMIT_SHA=${CI_COMMIT_SHA}" > last-push.txt
          echo "LAST_CI_COMMIT_TIMESTAMP=${CI_COMMIT_TIMESTAMP}" >> last-push.txt
          exit 0  # there is no cache, so this is the first pipeline to populate the cache
          # nothing to do. Alternatively, consider entire history/all files
      fi
      last_date=$(date -d "$LAST_CI_COMMIT_TIMESTAMP"  %s)
      this_date=$(date -d "$CI_COMMIT_TIMESTAMP"  %s)
      if [[ this_date <= last_date ]]; then
          exit 0  # current HEAD is older than last known HEAD. Someone may have re-run a pipeline on an older commit; exit to avoid giving the cache a bad value... there's probably a better way to handle this
      fi
      # show all commit SHAs since last push
      # hope the clone depth was large enough to get this!
      git log --since="$LAST_CI_COMMIT_TIMESTAMP" --pretty=%H
      # get files that have changed since then
      # hope the clone depth was large enough to get this!
      git diff --name-only HEAD "${LAST_CI_COMMIT_SHA}"
      # finally, store the current HEAD into the cache:
      echo "LAST_CI_COMMIT_SHA=${CI_COMMIT_SHA} > last-push.txt
      echo "LAST_CI_COMMIT_TIMESTAMP=${CI_COMMIT_TIMESTAMP}" >> last-push.txt

This is untested, so there may be minor bugs, but the general idea is there.

To resolve the issue in fact that git does not itself track push events, an alternative option may be to rely on the GitLab Project Events API to find the last push before the push that triggered the pipeline, but you would have to potentially sort out a lot of data, including pushes to other branches.

  • Related