2020-10-30 17:08:52 -04:00
---
2021-02-25 01:10:51 -05:00
stage: Create
group: Gitaly
2020-11-26 01:09:20 -05:00
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
2021-02-25 01:10:51 -05:00
type: reference
2020-10-30 17:08:52 -04:00
---
2019-03-19 09:25:12 -04:00
# How Git object deduplication works in GitLab
2019-10-27 02:06:30 -04:00
When a GitLab user [forks a project ](../user/project/repository/forking_workflow.md ),
2019-03-19 09:25:12 -04:00
GitLab creates a new Project with an associated Git repository that is a
copy of the original project at the time of the fork. If a large project
gets forked often, this can lead to a quick increase in Git repository
storage disk use. To counteract this problem, we are adding Git object
2020-12-01 19:09:45 -05:00
deduplication for forks to GitLab. In this document, we describe how
2019-03-19 09:25:12 -04:00
GitLab implements Git object deduplication.
## Pool repositories
### Understanding Git alternates
At the Git level, we achieve deduplication by using [Git
alternates](https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objects).
Git alternates is a mechanism that lets a repository borrow objects from
another repository on the same machine.
If we want repository A to borrow from repository B, we first write a
path that resolves to `B.git/objects` in the special file
`A.git/objects/info/alternates` . This establishes the alternates link.
Next, we must perform a Git repack in A. After the repack, any objects
2020-12-01 19:09:45 -05:00
that are duplicated between A and B are deleted from A. Repository
2019-03-19 09:25:12 -04:00
A is now no longer self-contained, but it still has its own refs and
2020-12-01 19:09:45 -05:00
configuration. Objects in A that are not in B remain in A. For this
2019-03-19 09:25:12 -04:00
to work, it is of course critical that **no objects ever get deleted from
B** because A might need them.
2020-12-07 19:09:45 -05:00
WARNING:
2019-06-12 03:12:15 -04:00
Do not run `git prune` or `git gc` in pool repositories! This can
cause data loss in "real" repositories that depend on the pool in
question.
The danger lies in `git prune` , and `git gc` calls `git prune` . The
problem is that `git prune` , when running in a pool repository, cannot
reliable decide if an object is no longer needed.
2019-03-19 09:25:12 -04:00
### Git alternates in GitLab: pool repositories
GitLab organizes this object borrowing by creating special **pool
repositories** which are hidden from the user. We then use Git
alternates to let a collection of project repositories borrow from a
single pool repository. We call such a collection of project
repositories a pool. Pools form star-shaped networks of repositories
2020-12-01 19:09:45 -05:00
that borrow from a single pool, which resemble (but not be
2019-03-19 09:25:12 -04:00
identical to) the fork networks that get formed when users fork
projects.
At the Git level, pool repositories are created and managed using Gitaly
RPC calls. Just like with normal repositories, the authority on which
pool repositories exist, and which repositories borrow from them, lies
at the Rails application level in SQL.
In conclusion, we need three things for effective object deduplication
across a collection of GitLab project repositories at the Git level:
2019-07-17 21:15:58 -04:00
1. A pool repository must exist.
1. The participating project repositories must be linked to the pool
repository via their respective `objects/info/alternates` files.
1. The pool repository must contain Git object data common to the
participating project repositories.
2019-03-19 09:25:12 -04:00
### Deduplication factor
The effectiveness of Git object deduplication in GitLab depends on the
amount of overlap between the pool repository and each of its
2019-06-12 03:12:15 -04:00
participants. Each time garbage collection runs on the source project,
2020-12-01 19:09:45 -05:00
Git objects from the source project are migrated to the pool
2019-06-12 03:12:15 -04:00
repository. One by one, as garbage collection runs, other member
2020-12-01 19:09:45 -05:00
projects benefit from the new objects that got added to the pool.
2019-03-19 09:25:12 -04:00
## SQL model
As of GitLab 11.8, project repositories in GitLab do not have their own
SQL table. They are indirectly identified by columns on the `projects`
table. In other words, the only way to look up a project repository is to
first look up its project, and then call `project.repository` .
With pool repositories we made a fresh start. These live in their own
`pool_repositories` SQL table. The relations between these two tables
are as follows:
2019-07-17 21:15:58 -04:00
- a `Project` belongs to at most one `PoolRepository`
(`project.pool_repository`)
- as an automatic consequence of the above, a `PoolRepository` has
many `Project` s
- a `PoolRepository` has exactly one "source `Project` "
(`pool.source_project`)
2019-03-19 09:25:12 -04:00
2019-06-12 03:12:15 -04:00
> TODO Fix invalid SQL data for pools created prior to GitLab 11.11
2020-05-21 02:08:25 -04:00
> <https://gitlab.com/gitlab-org/gitaly/-/issues/1653>.
2019-06-12 03:12:15 -04:00
2019-03-19 09:25:12 -04:00
### Assumptions
2019-07-17 21:15:58 -04:00
- All repositories in a pool must use [hashed
storage](../administration/repository_storage_types.md). This is so
that we don't have to ever worry about updating paths in
`object/info/alternates` files.
- All repositories in a pool must be on the same Gitaly storage shard.
The Git alternates mechanism relies on direct disk access across
multiple repositories, and we can only assume direct disk access to
be possible within a Gitaly storage shard.
- The only two ways to remove a member project from a pool are (1) to
delete the project or (2) to move the project to another Gitaly
storage shard.
2019-03-19 09:25:12 -04:00
### Creating pools and pool memberships
2019-07-17 21:15:58 -04:00
- When a pool gets created, it must have a source project. The initial
contents of the pool repository are a Git clone of the source
project repository.
- The occasion for creating a pool is when an existing eligible
2019-11-29 04:06:31 -05:00
(non-private, hashed storage, non-forked) GitLab project gets forked and
2019-07-17 21:15:58 -04:00
this project does not belong to a pool repository yet. The fork
parent project becomes the source project of the new pool, and both
the fork parent and the fork child project become members of the new
pool.
- Once project A has become the source project of a pool, all future
2020-12-01 19:09:45 -05:00
eligible forks of A become pool members.
2019-07-17 21:15:58 -04:00
- If the fork source is itself a fork, the resulting repository will
2020-12-01 19:09:45 -05:00
neither join the repository nor is a new pool repository
2019-07-17 21:15:58 -04:00
seeded.
2020-12-01 19:09:45 -05:00
Such as:
2019-07-17 21:15:58 -04:00
Suppose fork A is part of a pool repository, any forks created off
2020-12-01 19:09:45 -05:00
of fork A *are not* a part of the pool repository that fork A is
2019-07-17 21:15:58 -04:00
a part of.
Suppose B is a fork of A, and A does not belong to an object pool.
2020-12-01 19:09:45 -05:00
Now C gets created as a fork of B. C is not part of a pool
2019-07-17 21:15:58 -04:00
repository.
2019-03-19 09:25:12 -04:00
> TODO should forks of forks be deduplicated?
2020-05-21 02:08:25 -04:00
> <https://gitlab.com/gitlab-org/gitaly/-/issues/1532>
2019-03-19 09:25:12 -04:00
### Consequences
2019-07-17 21:15:58 -04:00
- If a normal Project participating in a pool gets moved to another
Gitaly storage shard, its "belongs to PoolRepository" relation will
be broken. Because of the way moving repositories between shard is
2020-12-01 19:09:45 -05:00
implemented, we get a fresh self-contained copy
2019-07-17 21:15:58 -04:00
of the project's repository on the new storage shard.
- If the source project of a pool gets moved to another Gitaly storage
shard or is deleted the "source project" relation is not broken.
2020-12-01 19:09:45 -05:00
However, as of GitLab 12.0 a pool does not fetch from a source
2019-07-17 21:15:58 -04:00
unless the source is on the same Gitaly shard.
2019-03-19 09:25:12 -04:00
## Consistency between the SQL pool relation and Gitaly
As far as Gitaly is concerned, the SQL pool relations make two types of
claims about the state of affairs on the Gitaly server: pool repository
existence, and the existence of an alternates connection between a
repository and a pool.
### Pool existence
2020-11-10 07:08:57 -05:00
If GitLab thinks a pool repository exists (i.e. it exists according to
2020-12-01 19:09:45 -05:00
SQL), but it does not on the Gitaly server, then it is created on
2019-06-12 03:12:15 -04:00
the fly by Gitaly.
2019-03-19 09:25:12 -04:00
### Pool relation existence
There are three different things that can go wrong here.
2020-12-11 01:10:17 -05:00
#### 1. SQL says repository A belongs to pool P but Gitaly says A has no alternate objects
2019-03-19 09:25:12 -04:00
2019-06-12 03:12:15 -04:00
In this case, we miss out on disk space savings but all RPC's on A
2020-12-01 19:09:45 -05:00
itself function fine. The next time garbage collection runs on A,
2019-06-12 03:12:15 -04:00
the alternates connection gets established in Gitaly. This is done by
2019-08-27 04:44:07 -04:00
`Projects::GitDeduplicationService` in GitLab Rails.
2019-03-19 09:25:12 -04:00
2020-12-11 01:10:17 -05:00
#### 2. SQL says repository A belongs to pool P1 but Gitaly says A has alternate objects in pool P2
2019-03-19 09:25:12 -04:00
2020-12-01 19:09:45 -05:00
In this case `Projects::GitDeduplicationService` throws an exception.
2019-03-19 09:25:12 -04:00
2020-12-11 01:10:17 -05:00
#### 3. SQL says repository A does not belong to any pool but Gitaly says A belongs to P
2019-03-19 09:25:12 -04:00
2020-12-01 19:09:45 -05:00
In this case `Projects::GitDeduplicationService` tries to
2019-06-12 03:12:15 -04:00
"re-duplicate" the repository A using the DisconnectGitAlternates RPC.
2019-03-19 09:25:12 -04:00
## Git object deduplication and GitLab Geo
When a pool repository record is created in SQL on a Geo primary, this
2020-12-01 19:09:45 -05:00
eventually triggers an event on the Geo secondary. The Geo secondary
then creates the pool repository in Gitaly. This leads to an
2019-03-19 09:25:12 -04:00
"eventually consistent" situation because as each pool participant gets
2020-12-01 19:09:45 -05:00
synchronized, Geo eventually triggers garbage collection in Gitaly on
the secondary, at which stage Git objects are deduplicated.
2019-03-19 09:25:12 -04:00
> TODO How do we handle the edge case where at the time the Geo
> secondary tries to create the pool repository, the source project does
2020-05-21 02:08:25 -04:00
> not exist? <https://gitlab.com/gitlab-org/gitaly/-/issues/1533>