Home > front end >  Handling of local changes when switching branch
Handling of local changes when switching branch

Time:11-26

What happens with this simple workflow:

x@PC MINGW64 /c/Temp/tests/git/branches/changes
$ git init
Initialized empty Git repository in C:/Temp/tests/git/branches/changes/.git/

x@PC MINGW64 /c/Temp/tests/git/branches/changes (master)
$ echo "CHANGE #1" >> test.txt

x@PC MINGW64 /c/Temp/tests/git/branches/changes (master)
$ git add test.txt

x@PC MINGW64 /c/Temp/tests/git/branches/changes (master)
$ git commit -m "."
[master (root-commit) 439c0f8] .
 1 file changed, 1 insertion( )
 create mode 100644 test.txt

x@PC MINGW64 /c/Temp/tests/git/branches/changes (master)
$ git branch branch-1

x@PC MINGW64 /c/Temp/tests/git/branches/changes (master)
$ echo "CHANGE #2" >> test.txt

x@PC MINGW64 /c/Temp/tests/git/branches/changes (master)
$ cat test.txt
CHANGE #1
CHANGE #2

x@PC MINGW64 /c/Temp/tests/git/branches/changes (master)
$ git switch branch-1
Switched to branch 'branch-1'
M       test.txt

x@PC MINGW64 /c/Temp/tests/git/branches/changes (branch-1)
$ git add test.txt

x@PC MINGW64 /c/Temp/tests/git/branches/changes (branch-1)
$ git commit -m "."
[branch-1 4c62bc9] .
 1 file changed, 1 insertion( )

x@PC MINGW64 /c/Temp/tests/git/branches/changes (branch-1)
$ git switch master
Switched to branch 'master'

x@PC MINGW64 /c/Temp/tests/git/branches/changes (master)
$ cat test.txt
CHANGE #1

With words:

  • when working in master create a file with "CHANGE #1"
  • add and commit it
  • create another branch branch-1
  • make another change adding "CHANGE #2"
  • switch to branch-1
  • add and commit the file
  • switch back to master

(the order of creating the branch and making the second change does not seem to have any importance)

I was surprised by:

  • seeing local changes made "in the context of master" in branch-1
  • not seeing the changes anymore when switching back to master

So I have 2 questions:

  1. When switching to branch-1 the local changes have been left untouched, so they are not associated with master, but seem merely ignored by Git, where is this behaviour documented?
  2. After committing the changes from branch-1, and switching back to master the second change is no more visible from master: in gross terms, the change has been captured on branch-1, what is the exact terminology (snapshot)?

CodePudding user response:

eftshift0's answer covers the practical aspects here. There's something important that you have missed about how Git works that explains why this happens, though.

It's common for those new to Git (or who use it only sporadically) to think that when you clone a repository and check out some commit, the files that you can see, read, edit, and so on are the files that are in Git. This is wrong: the files in your working tree are not in Git. They may have just come out of Git, but now they're not in Git. I'll expand on this idea in a moment since it can be quite confusing.

The fact that these files aren't in Git explains—or at least, is necessary to comprehend the explanation—why the files are still there after you've switched to some other branch. They're simply still there and still not in Git. You need to grab hold, mentally, of the idea of what is in Git and what isn't in Git.

What is in Git

Git works with a repository—a single repository at a time.1 A repository is, as noted in the gitglossary:

A collection of refs together with an object database containing all objects which are reachable from the refs ...

This "collection of refs" is really a second database, holding branch names, tag names, and many other kinds of names. It's just currently rather poorly implemented ("poorly" at least in a generic sense: the default files-and-packed-file system works fine on Linux for small repositories that don't have tens of thousands of refs). So a repository is, at its heart, just two databases. There are a bunch of ancillary auxiliary files and additional databases in most repositories, andthis part is important for getting any new work done—most of the repositories you'll use directly provide a working tree as well.

Peculiarly, Git puts the repository proper—the two databases and the various small files and stuff—inside the working tree, in a hidden .git folder. The stuff in the .git folder is the repository. The working tree isn't in the .git folder. The working tree is thus outside the repository.

Inside the repository, one database—the one the glossary doesn't call out as a database—contains your branch and tag and other names, which help you and Git find the commits that you care about. The other database, the one "containing all objects" as it says, has the actual commits and files and so on.

From a high level viewpoint, then, the repository:

  • contains names that help find commits, and
  • contains commits

and that's mostly it! But obviously that's not really enough, so we have to look inside the commits. Each commit:

  • is numbered, so that it can be accessed by its unique number, which Git calls its object ID (OID) formally, or hash ID less formally;
  • is fully read-only: no part of any existing commit (or any object, really) can ever be changed; and
  • has two parts: metadata, which we'll ignore here, and a full snapshot of every file.

The full snapshot is stored indirectly, through yet more Git objects, each of which is numbered and read-only as with the commit objects.

So the files that are in a Git repository are found through the commits in the repository, which we find using things like branch names. But since they're objects in this object database, they're read-only—and, important for various reasons, they're specially formatted, pre-compressed and with file contents de-duplicated within and across commits. This saves enormous amounts of space in a typical repository objects database, because most commits mostly have most of the same contents as the previous commit, which mostly has the same contents as the next-earlier commit, and so on.


1Internally, inside at least one implementation of Git—the one most often described since it's the original C version—there's a global variable named the_repository. A Git program, at startup, generally figures out where the repository is, and populates this variable's fields. There used to be a single global the_index as well, and with the option of adding new working trees (git worktree add) this became a problem, so it's been reworked. There is ongoing work now to make submodules work better, and the submodules have the same kind of problem: each submodule is a Git repository, so having a single global "the" Git repository variable is a problem.


What's not in Git

First let's do a lightning review. Part of what is in Git:

  • The repository stores commits.
  • The commits store files: a full archive of every file, frozen for all time.

But the files inside the commits are in a special, compressed, read-only, Git-only, de-duplicated format. You literally can't read them—only Git can read them2—and nothing, not even Git itself, can overwrite them. So they're completely useless for getting anything done!

For this reason, before you can actually do anything, you must have Git extract the files from some commit. This is the checking-out process. Once you have a repository, you use git switch (new in 2.23) or git checkout (pre-2.23, still works fine, just has some confusing cases that finally convinced the Git folks to add git switch) to fill in an empty working tree. The working tree, as its name implies, is where you get to work with / on your files. Formally, the working tree contains ordinary OS files.

The act of selecting a commit to check out, with git checkout or git switch, essentially tells Git: I'd like you to populate the working tree from the commit I have selected. If your working tree is completely empty, as it is in a fresh new clone, this means: For every file in the commit, expand it out to a normal usable file.

Once you've done that, though, you now have two copies of each of these "active" files:

  • There's a read-only, Git-ized, compressed and de-duplicated copy inside the commit (technically, inside the object database, with the commit just finding it for you / Git).
  • There's an ordinary read/write copy of the file in your working tree.

These two match. That makes it safe to remove the working tree copy—until you change it, that is!

So, what happens when you change the working tree copy, in terms of Git? The answer is: Nothing happens. The working tree copy isn't in Git. You change it and, well, it's changed. Git doesn't even know or care. It's not in Git. You changed it with something that isn't Git.

But now, you've asked Git to switch to some other branch:

git switch branch-1

or:

git switch master

Things now may get ... complicated.


2There are two formats for Git's internal objects. One is not very difficult to read, so with a simple zlib decompressor library and some simple programming, many programs could read these. The other format is much more compressed though and requires very specialized code to handle.


Branch names and commit hash IDs

I've already mentioned that the branch names are included in the "refs" in one of the two databases, and that commits have unique hash ID numbers. The hash IDs look random (they are not random at all but we'll ignore the details here), but the important part here is the "unique" thing. Each commit has a unique ID. This is how Git tells which commit is which.

Because the numbers are so big and ugly and random-looking (e.g., 63bba4fdd86d80ef061c449daa97a981a9be0792), humans are bad at them. We use the names instead. We say master or branch-1 or whatever. Git looks up the name in the refs database and gets the big ugly number, and that's the commit you said you'd like.

Sometimes, when you say:

git switch xyzzy

for some name xyzzy, you're telling Git: switch to a different commit hash ID while remembering the new name. But some branch names store the same big ugly hash ID, sometimes. When the number is the same, you're telling Git: switch to the same commit, but remember the new name.

That's the case when you have not made a new commit, but have made a new branch name, as you did here:

$ git branch branch-1    # while you were on "master"
...
$ git switch branch-1

Git will remember which name is the current branch name, and will use the refs database entry for master or branch-1 to look up the big ugly hash ID. Because both names currently select the same hash ID, you're not actually changing commits. (For the record, we can see above, in your question, that the abbreviated hash ID of this commit is 439c0f8. Git printed it out when you made the root commit.)

If you're not changing commits, Git never has to change any files. So it doesn't bother. This means you can easily switch branches, even if you have uncommitted work.

If you are changing commits, though, Git may have to replace some files in your working tree. This is when things do get complicated.

Git's index or staging area

I already mentioned the two obvious copies of each file that must exist:

  • the frozen committed copy of the files in the current commit, and
  • the usable, ordinary-file copy of the files that you're working on/with.

The first is in Git and the second isn't. But Git, for its own Gitty reasons, goes on to keep a secret third copy—or "copy"—of each file:

  • the third "copy" of each file is in Git's index or staging area.3

These two terms, index and staging area, refer to the same thing; there's a third term, mostly obsolete now, cache, that you mostly see in flags like git rm --cached. They all refer to this place that stores this third copy, or "copy", of each file.

I keep putting this in quotes like this because the index version of a file is pre-de-duplicated. That is, if the index copy of some file is a duplicate of some existing file, it's already de-duplicated. When you first check out the first commit and fill in your working tree for the first time, that fills in Git's index for the first time too.

Since all the files that go into Git's index are, literally, duplicates—they're the exact versions of the files that are in the commit being checked out—they are all de-duplicated away and therefore take no space. But other than this, it's easiest to think of these as separate copies, and the reason for that is simple: The index copy of any file can be replaced at any time. Running git add tells Git to update the index copy: Git reads and compresses the working tree copy, de-duplicates it if it's a duplicate, and updates the index copy with the result.

The index copies of files are sort of "halfway into" Git. They become permanent the moment you run git commit, which tells Git: Make a new snapshot, using the pre-de-duplicated files already in the index.

Since the index already contains all the files from the current commit—unless, that is, you've removed or replaced them—the new commit contains exactly the same files as the current commit, except for the ones that you've replaced by git add-ing. So the new commit is a full snapshot of every file, with unchanged files not taking any extra space because they're de-duplicated. Note that this de-duplication takes no time either since the index copies are all pre-de-duplicated. It's actually all rather clever.

Now, though, things get complicated when actually changing commits, because now Git has a fast way to detect which files really need changing.


3As noted in footnote 1, it's no longer really the index, as each added working tree gets its own separate index. So it's "this working tree's index". But there's a particular primary working tree, and that particular primary working tree gets the initial index that comes with every Git repository, even a bare one that has no working tree. This is just a historic oddity, at this point, but it has to be maintained for backwards compatibility.


Actually changing commits

Suppose we are now on commit 4c62bc9, the second one you made, which you made while you were "on" branch branch-1. You now run:

git switch master

which means "switch to branch master and commit 439c0f8. This is a different commit hash ID. Git can't completely short-cut the switch: it can't just store a new name and say "all done". Git has to take all the files out of its index and your working tree that go with commit 4c62bc9, your second commit, and instead fill in its index and your working tree with all the files from commit 439c0f8, your first commit.

But Git can still cheat! The index holds inside itself the hash IDs of each of the files from the current (4c62bc9, branch-1) commit, and Git can very quickly (through the unique hash ID trick) know which files in the to-be-switched-to commit 439c0f8 are identical. For each of those files, it can leave the index entry alone and leave the file itself alone too. And that's what Git does.

So, if you have changed some files and not committed, and those turn out to be files that Git must delete and maybe replace because they're not the same in the commit you're moving to, Git will stop and complain that you have uncommitted changes. But if you've changed other files and not committed, that might not stop you: those files are the same in the old and new commits, and don't have to be swapped out, so Git doesn't.

Useful reminders

If you have files that Git can carry across a branch-name-change (with or without a commit-hash-ID-change), Git will do that. This allows you to start work and then decide that, oops, this work was supposed to happen on a different branch. You don't have to save it now, switch branches, restore it, switch back, erase a commit, switch back again ... you can just switch and keep working.

As a reminder, though, Git prints that line:

M       test.txt

to make note that although Git switched from one branch name to another, there are uncommitted changes that Git didn't have to erase. It does this even for the full shortcut ("not changing any files at all because the commit hash ID is the same"). You can suppress the reminder (git switch -q), if you like.

If you can't switch branches, because the file you started to change is different in the other branch's tip commit, that's when you need to save your work-so-far. There are multiple ways to do that, including the fancy git stash command. I personally recommend avoiding git stash: just make actual commits, perhaps on a new temporary branch, and then cherry-pick them. This gives you the full Git tools if something goes wrong (vs git stash, which can wind up making a messy merge that can't be backed out, leaving you with a no-fun day: this doesn't happen often, but once you've had it happen even once, you probably don't want to go through it again).

Summary

That's pretty long, so here's a summary:

  • Only committed work is fully saved in Git.
  • Your working tree files are not in Git at all.
  • The (hidden) index copies of files matter a lot.

Use git status to see shadows that represent the useful part of what's going on in the index (see Plato's Cave), and how that compares to what's going on in your working tree.

There's a lot more, with some hints about that in this long answer, but those three bullet points, plus git status, are the big takeaways here.

CodePudding user response:

As long as the changes are not committed, if you decide to checkout a different branch, git will carry the changed files (or untracked) to the new branch... say, it won't touch those files in the working tree or the index.... and it is not a bug, it is intended to work that way, which is very convenient.

There's actually one check that git runs to allow the checkout does to make sure it won't lose your changes. If a modified file is different between HEAD and what you want to checkout, then it rejects the checkout (in order not to lose said changes). This can be overriden by using -f in checkout, in which case your changes get lost.

  • Related