Home > database >  How to get a subproject from commit list
How to get a subproject from commit list

Time:12-01

I'm trying to get all the commits from a GitLab repository, and it was all going smoothly and fine.

Because of another unrelated problem, I had to update my python from 3.7 to 3.9. Since then, every time I run my program I run this specific error:

Exception has occurred: ValueError SHA b'7e944e65ee1a628e7ba0d53aac7a7bb13e79fe53' could not be resolved, git returned: b'7e944e65ee1a628e7ba0d53aac7a7bb13e79fe53 missing'

Meanwhile, I discovered that this specific commit that is causing the error, besides modifying a file, also has this:

Subproject commit 7e944e65ee1a628e7ba0d53aac7a7bb13e79fe53

Does anyone know how I can fix this problem?

My code is as follows:

try:
        repo = Repository(repo_name).traverse_commits()
    except:
        print("Repository not found")
        exit(0)

for commit in repo:

    for f in commit.modified_files: #(error here)
         #(get info)

CodePudding user response:

A "subproject commit" (as printed this way by Git itself) is actually a gitlink, which is a very specific kind of item stored in one of two places:

  • in Git's index, as a path name and mode 160000 plus a hash ID; or
  • in a tree object, as a component name, mode 160000, and hash ID.

The hash ID in this case is the 7e944e65ee1a628e7ba0d53aac7a7bb13e79fe53 value, which is not itself a byte string but would often be represented as a byte string in Python.

The trick here is that this is the hash ID of a commit that should exist in some other Git repository. Unless the "other" Git repository is in fact this Git repository—a rare but not unheard-of situation1—it will not exist in this Git repository. You therefore can't look it up using this Git repository.

To get the commit, you must:

  1. Clone the other Git repository, if you have not yet done so: you'll find its URL in the .gitmodules file in the commit that contains this gitlink. If this gitlink's path is P, you must read the .gitmodules file and find each [submodule "name"] entry and under that entry find each url and path value. If the path value matches P, the URL value is the URL for the submodule.
  2. Now that the submodule is cloned, attempt to load the specified commit from the other Git repository. It may not exist: if that's the case, this is simply a bad / invalid gitlink, and cannot be used. If the specified commit does exist, that's the commit that this superproject commit says should be checked out, if this particular commit is also to be checked out.

Note that no commit contains any modified files:

for f in commit.modified_files:

At a guess, this refers to the pydriller library, which provides such a field. The problem here is that Git doesn't have this. Instead, Git commits have snapshots and metadata. One can synthetically compute a list of "modified files" by obtaining not only this commit's set of files in its snapshot, but also some previous commit's set of files, in its snapshot, and then comparing the two.

Computing this list—the "modified files" implied by comparing a commit to its (single) parent—is very useful, so both Pydriller and Git itself have a way to do that: in Git you run can run git show or git diff-tree for instance, while Pydriller simply Just Does It. But if you're not careful with this synthesized information, you will be led astray, just as you were here. It's important to realize that this list of modified files is an illusion, albeit a useful one. In some situations, it's less useful than others. When working with submodules, a difference of the form old gitlink was <hash1>, new gitlink is <hash2> is just that: a difference in gitlinks. It's up to you to realize that these are both gitlinks and hence both refer to some commit that should (but may or may not, at this point) exist in some other repository.


1If a repository refers to itself in a gitlink, this is a bit recursive and you may need to use even more care here. The only case I know of where it's common is with GitHub Pages, where people will insert a gitlink to the original repository, but carefully store different commits in different commit-chains so that the recursion terminates immediately.

  • Related