Design Git's design is a synthesis of Torvalds's experience with Linux in maintaining a large distributed development project, along with his intimate knowledge of file-system performance gained from the same project and the urgent need to produce a working system in short order. These influences led to the following implementation choices: ; Compatibility with existing systems and protocols: Repositories can be published via
Hypertext Transfer Protocol Secure (HTTPS),
Hypertext Transfer Protocol (HTTP),
File Transfer Protocol (FTP), or a Git protocol over either a plain socket or
Secure Shell (ssh). Git also has a CVS server emulation, which enables the use of existing CVS clients and IDE plugins to access Git repositories.
Subversion repositories can be used directly with git-svn. ; Efficient handling of large projects: Torvalds has described Git as being very fast and scalable, and performance tests done by Mozilla showed that it was an
order of magnitude faster diffing large repositories than
Mercurial and
GNU Bazaar; fetching version history from a locally stored repository can be one hundred times faster than fetching it from the remote server. ; Cryptographic authentication of history: The Git history is stored in such a way that the ID of a particular version (a
commit in Git terms) depends upon the complete development history leading up to that commit. Once it is published, it is not possible to change the old versions without it being noticed. The structure is similar to a
Merkle tree, but with added data at the nodes and leaves. (
Mercurial and
Monotone also have this property.) ; Toolkit-based design: Git was designed as a set of programs written in
C and several shell scripts that provide wrappers around those programs. Although most of those scripts have since been rewritten in C for speed and portability, the design remains, and it is easy to chain the components together. ; Pluggable merge strategies: As part of its toolkit design, Git has a well-defined model of an incomplete merge, and it has multiple algorithms for completing it, culminating in telling the user that it is unable to complete the merge automatically and that manual editing is needed. ;
Garbage accumulates until collected: Aborting operations or backing out changes will leave useless dangling objects in the database. These are generally a small fraction of the continuously growing history of wanted objects. Git will automatically perform
garbage collection when enough loose objects have been created in the repository. Garbage collection can be called explicitly using git gc. ; Periodic explicit object packing: Git stores each newly created object as a separate file. Although individually compressed, this takes up a great deal of space and is inefficient. This is solved by the use of
packs that store a large number of objects
delta-compressed among themselves in one file (or network byte stream) called a
packfile. Packs are compressed using the
heuristic that files with the same name are probably similar, without depending on this for correctness. A corresponding index file is created for each packfile, recording the offset of each object in the packfile. Newly created objects (with newly added history) are still stored as single objects, and periodic repacking is needed to maintain space efficiency. The process of packing the repository can be very computationally costly. By allowing objects to exist in the repository in a loose but quickly generated format, Git allows the costly pack operation to be deferred until later, when time matters less, e.g., the end of a workday. Git does periodic repacking automatically, but manual repacking is also possible with the git gc command. For data integrity, both the packfile and its index have an
SHA-1 checksum inside, and the file name of the packfile also contains an SHA-1 checksum. To check the integrity of a repository, run the git fsck command. Another property of Git is that it snapshots directory trees of files. The earliest systems for tracking versions of source code,
Source Code Control System (SCCS) and
Revision Control System (RCS), worked on individual files and emphasized the space savings to be gained from
interleaved deltas (SCCS) or
delta encoding (RCS) the (mostly similar) versions. Later revision-control systems maintained this notion of a file having an identity across multiple revisions of a project. However, Torvalds rejected this concept. Consequently, Git does not explicitly record file revision relationships at any level below the source-code tree.
Downsides These implicit revision relationships have some significant consequences: • It is slightly more costly to examine the change history of one file than the whole project. To obtain a history of changes affecting a given file, Git must walk the global history and then determine whether each change modified that file. This method of examining history does, however, let Git produce with equal efficiency a single history showing the changes to an arbitrary set of files. For example, a subdirectory of the source tree plus an associated global header file is a very common case. • Renames are handled implicitly rather than explicitly. A common complaint with
CVS is that it uses the name of a file to identify its revision history, so moving or renaming a file is not possible without either interrupting its history or renaming the history and thereby making the history inaccurate. Most post-CVS revision-control systems solve this by giving a file a unique long-lived name (analogous to an
inode number) that survives renaming. Git does not record such an identifier, and this is claimed as an advantage.
Source code files are sometimes split or merged, or simply renamed, and recording this as a simple rename would freeze an inaccurate description of what happened in the (immutable) history. Git addresses the issue by detecting renames while browsing the history of snapshots rather than recording it when making the snapshot. (Briefly, given a file in revision
N, a file of the same name in revision
N − 1 is its default ancestor. However, when there is no like-named file in revision
N − 1, Git searches for a file that existed only in revision
N − 1 and is very similar to the new file.) However, it does require more
CPU-intensive work every time the history is reviewed, and several options to adjust the heuristics are available. This mechanism does not always work; sometimes a file that is renamed with changes in the same commit is read as a deletion of the old file and the creation of a new file. Developers can work around this limitation by committing the rename and the changes separately.
Merging strategies Git implements several merging strategies; a non-default strategy can be selected at merge time: •
resolve: the traditional
three-way merge algorithm. •
recursive: This is the default when pulling or merging one branch, and is a variant of the three-way merge algorithm. •
octopus: This is the default when merging more than two heads.
Data structures Git's primitives are not inherently a
source-code management system. Torvalds explains: From this initial design approach, Git has developed the full set of features expected of a traditional SCM, with features mostly being created as needed, then refined and extended over time. Git has two
data structures: a mutable
index (also called
stage or
cache) that caches information about the working directory and the next revision to be committed; and an
object database that stores immutable objects. The index serves as a connection point between the object database and the working tree. •
Heads (branches): Named references that are advanced automatically to the new commit when a commit is made on top of them. •
HEAD: A reserved head that will be compared against the working tree to create a commit. •
Tags: Like branch references, but fixed to a particular commit. Used to label important points in history.
Commands Frequently used commands for Git's
command-line interface include: • git init, which is used to create a git repository. • git clone [URL], which
clones, or duplicates, a git repository from an external URL. • git add [file], which adds a file to git's
working directory (files about to be committed). • git commit -m [commit message], which
commits the files from the current working directory (so they are now part of the repository's history). A
.gitignore file may be created in a Git repository as a plain
text file. The files listed in the
.gitignore file will
not be tracked by Git. This feature can be used to ignore files with keys or passwords, various extraneous files, and large files (which
GitHub will refuse to upload).
Git references Every object in the Git database that is not referred to may be cleaned up by using a garbage collection command or automatically. An object may be referenced by another object or an explicit reference. Git has different types of references. The commands to create, move, and delete references vary. git show-ref lists all references. Some types are: •
heads: refers to an object locally, •
remotes: refers to an object which exists in a remote repository, •
stash: refers to an object not yet committed, •
meta:
e.g., a configuration in a bare repository, user rights; the refs/meta/config namespace was introduced retrospectively, gets used by
Gerrit, •
tags: see above. ==Implementations==