2020-05-19 12:08:21 +00:00
|
|
|
# Repository storage types **(CORE ONLY)**
|
2017-09-06 05:16:26 +00:00
|
|
|
|
2020-05-21 06:08:25 +00:00
|
|
|
> - [Introduced](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/28283) in GitLab 10.0.
|
2020-05-19 12:08:21 +00:00
|
|
|
> - Hashed storage became the default for new installations in GitLab 12.0
|
|
|
|
> - Hashed storage is enabled by default for new and renamed projects in GitLab 13.0.
|
2019-03-11 16:38:19 +00:00
|
|
|
|
2019-10-22 12:06:20 +00:00
|
|
|
GitLab can be configured to use one or multiple repository storage paths/shard
|
|
|
|
locations that can be:
|
2019-03-11 16:38:19 +00:00
|
|
|
|
|
|
|
- Mounted to the local disk
|
|
|
|
- Exposed as an NFS shared volume
|
2020-04-21 15:21:10 +00:00
|
|
|
- Accessed via [Gitaly](gitaly/index.md) on its own machine.
|
2019-03-11 16:38:19 +00:00
|
|
|
|
|
|
|
In GitLab, this is configured in `/etc/gitlab/gitlab.rb` by the `git_data_dirs({})`
|
2019-07-09 03:28:41 +00:00
|
|
|
configuration hash. The storage layouts discussed here will apply to any shard
|
2019-03-11 16:38:19 +00:00
|
|
|
defined in it.
|
|
|
|
|
|
|
|
The `default` repository shard that is available in any installations
|
|
|
|
that haven't customized it, points to the local folder: `/var/opt/gitlab/git-data`.
|
2019-07-09 03:28:41 +00:00
|
|
|
Anything discussed below is expected to be part of that folder.
|
2019-03-11 16:38:19 +00:00
|
|
|
|
2020-05-19 12:08:21 +00:00
|
|
|
## Hashed storage
|
2017-09-06 05:16:26 +00:00
|
|
|
|
2020-05-19 12:08:21 +00:00
|
|
|
NOTE: **Note:**
|
|
|
|
In GitLab 13.0, hashed storage is enabled by default and the legacy storage is
|
|
|
|
deprecated. Support for legacy storage will be removed in GitLab 14.0.
|
|
|
|
If you haven't migrated yet, check the
|
|
|
|
[migration instructions](raketasks/storage.md#migrate-to-hashed-storage).
|
|
|
|
The option to choose between hashed and legacy storage in the admin area has
|
|
|
|
been disabled.
|
2019-06-13 03:18:05 +00:00
|
|
|
|
2020-05-19 12:08:21 +00:00
|
|
|
Hashed storage is the storage behavior we rolled out with 10.0. Instead
|
2018-02-08 18:33:35 +00:00
|
|
|
of coupling project URL and the folder structure where the repository will be
|
|
|
|
stored on disk, we are coupling a hash, based on the project's ID. This makes
|
|
|
|
the folder structure immutable, and therefore eliminates any requirement to
|
|
|
|
synchronize state from URLs to disk structure. This means that renaming a group,
|
|
|
|
user, or project will cost only the database transaction, and will take effect
|
|
|
|
immediately.
|
2017-09-06 05:16:26 +00:00
|
|
|
|
2018-02-08 18:33:35 +00:00
|
|
|
The hash also helps to spread the repositories more evenly on the disk, so the
|
|
|
|
top-level directory will contain less folders than the total amount of top-level
|
|
|
|
namespaces.
|
2017-09-06 05:16:26 +00:00
|
|
|
|
2018-02-08 18:33:35 +00:00
|
|
|
The hash format is based on the hexadecimal representation of SHA256:
|
|
|
|
`SHA256(project.id)`. The top-level folder uses the first 2 characters, followed
|
|
|
|
by another folder with the next 2 characters. They are both stored in a special
|
|
|
|
`@hashed` folder, to be able to co-exist with existing Legacy Storage projects:
|
2017-09-06 05:16:26 +00:00
|
|
|
|
|
|
|
```ruby
|
|
|
|
# Project's repository:
|
|
|
|
"@hashed/#{hash[0..1]}/#{hash[2..3]}/#{hash}.git"
|
|
|
|
|
|
|
|
# Wiki's repository:
|
|
|
|
"@hashed/#{hash[0..1]}/#{hash[2..3]}/#{hash}.wiki.git"
|
|
|
|
```
|
|
|
|
|
2020-03-20 09:09:22 +00:00
|
|
|
### Translating hashed storage paths
|
|
|
|
|
|
|
|
Troubleshooting problems with the Git repositories, adding hooks, and other
|
|
|
|
tasks will require you translate between the human readable project name
|
|
|
|
and the hashed storage path.
|
|
|
|
|
|
|
|
#### From project name to hashed path
|
|
|
|
|
|
|
|
The hashed path is shown on the project's page in the [admin area](../user/admin_area/index.md#administering-projects).
|
|
|
|
|
|
|
|
To access the Projects page, go to **Admin Area > Overview > Projects** and then
|
|
|
|
open up the page for the project.
|
|
|
|
|
|
|
|
The "Gitaly relative path" is shown there, for example:
|
|
|
|
|
2020-03-25 06:07:58 +00:00
|
|
|
```plaintext
|
2020-03-20 09:09:22 +00:00
|
|
|
"@hashed/b1/7e/b17ef6d19c7a5b1ee83b907c595526dcb1eb06db8227d650d5dda0a9f4ce8cd9.git"
|
|
|
|
```
|
|
|
|
|
|
|
|
This is the path under `/var/opt/gitlab/git-data/repositories/` on a
|
|
|
|
default Omnibus installation.
|
|
|
|
|
2020-04-06 15:10:04 +00:00
|
|
|
In a [Rails console](troubleshooting/debug.md#starting-a-rails-console-session),
|
2020-03-20 09:09:22 +00:00
|
|
|
get this information using either the numeric project ID or the full path:
|
|
|
|
|
|
|
|
```ruby
|
|
|
|
Project.find(16).disk_path
|
|
|
|
Project.find_by_full_path('group/project').disk_path
|
|
|
|
```
|
|
|
|
|
|
|
|
#### From hashed path to project name
|
|
|
|
|
|
|
|
To translate from a hashed storage path to a project name:
|
|
|
|
|
2020-04-06 15:10:04 +00:00
|
|
|
1. Start a [Rails console](troubleshooting/debug.md#starting-a-rails-console-session).
|
2020-03-20 09:09:22 +00:00
|
|
|
1. Run the following:
|
|
|
|
|
|
|
|
```ruby
|
|
|
|
ProjectRepository.find_by(disk_path: '@hashed/b1/7e/b17ef6d19c7a5b1ee83b907c595526dcb1eb06db8227d650d5dda0a9f4ce8cd9').project
|
|
|
|
```
|
|
|
|
|
|
|
|
The quoted string in that command is the directory tree you'll find on your
|
|
|
|
GitLab server. For example, on a default Omnibus installation this would be
|
|
|
|
`/var/opt/gitlab/git-data/repositories/@hashed/b1/7e/b17ef6d19c7a5b1ee83b907c595526dcb1eb06db8227d650d5dda0a9f4ce8cd9.git`
|
|
|
|
with `.git` from the end of the directory name removed.
|
|
|
|
|
2020-05-07 06:09:38 +00:00
|
|
|
The output includes the project ID and the project name:
|
2020-03-20 09:09:22 +00:00
|
|
|
|
2020-03-25 06:07:58 +00:00
|
|
|
```plaintext
|
2020-03-20 09:09:22 +00:00
|
|
|
=> #<Project id:16 it/supportteam/ticketsystem>
|
|
|
|
```
|
|
|
|
|
2019-03-11 16:38:19 +00:00
|
|
|
### Hashed object pools
|
Allow public forks to be deduplicated
When a project is forked, the new repository used to be a deep copy of everything
stored on disk by leveraging `git clone`. This works well, and makes isolation
between repository easy. However, the clone is at the start 100% the same as the
origin repository. And in the case of the objects in the object directory, this
is almost always going to be a lot of duplication.
Object Pools are a way to create a third repository that essentially only exists
for its 'objects' subdirectory. This third repository's object directory will be
set as alternate location for objects. This means that in the case an object is
missing in the local repository, git will look in another location. This other
location is the object pool repository.
When Git performs garbage collection, it's smart enough to check the
alternate location. When objects are duplicated, it will allow git to
throw one copy away. This copy is on the local repository, where to pool
remains as is.
These pools have an origin location, which for now will always be a
repository that itself is not a fork. When the root of a fork network is
forked by a user, the fork still clones the full repository. Async, the
pool repository will be created.
Either one of these processes can be done earlier than the other. To
handle this race condition, the Join ObjectPool operation is
idempotent. Given its idempotent, we can schedule it twice, with the
same effect.
To accommodate the holding of state two migrations have been added.
1. Added a state column to the pool_repositories column. This column is
managed by the state machine, allowing for hooks on transitions.
2. pool_repositories now has a source_project_id. This column in
convenient to have for multiple reasons: it has a unique index allowing
the database to handle race conditions when creating a new record. Also,
it's nice to know who the host is. As that's a short link to the fork
networks root.
Object pools are only available for public project, which use hashed
storage and when forking from the root of the fork network. (That is,
the project being forked from itself isn't a fork)
In this commit message I use both ObjectPool and Pool repositories,
which are alike, but different from each other. ObjectPool refers to
whatever is on the disk stored and managed by Gitaly. PoolRepository is
the record in the database.
2018-12-03 13:49:58 +00:00
|
|
|
|
2020-05-21 06:08:25 +00:00
|
|
|
> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/1606) in GitLab 12.1.
|
2019-03-20 14:01:54 +00:00
|
|
|
|
2020-05-19 12:08:21 +00:00
|
|
|
DANGER: **Danger:**
|
|
|
|
Do not run `git prune` or `git gc` in pool repositories! This can
|
|
|
|
cause data loss in "real" repositories that depend on the pool in
|
|
|
|
question.
|
|
|
|
|
2019-08-26 10:31:25 +00:00
|
|
|
Forks of public projects are deduplicated by creating a third repository, the
|
|
|
|
object pool, containing the objects from the source project. Using
|
|
|
|
`objects/info/alternates`, the source project and forks use the object pool for
|
|
|
|
shared objects. Objects are moved from the source project to the object pool
|
|
|
|
when housekeeping is run on the source project.
|
Allow public forks to be deduplicated
When a project is forked, the new repository used to be a deep copy of everything
stored on disk by leveraging `git clone`. This works well, and makes isolation
between repository easy. However, the clone is at the start 100% the same as the
origin repository. And in the case of the objects in the object directory, this
is almost always going to be a lot of duplication.
Object Pools are a way to create a third repository that essentially only exists
for its 'objects' subdirectory. This third repository's object directory will be
set as alternate location for objects. This means that in the case an object is
missing in the local repository, git will look in another location. This other
location is the object pool repository.
When Git performs garbage collection, it's smart enough to check the
alternate location. When objects are duplicated, it will allow git to
throw one copy away. This copy is on the local repository, where to pool
remains as is.
These pools have an origin location, which for now will always be a
repository that itself is not a fork. When the root of a fork network is
forked by a user, the fork still clones the full repository. Async, the
pool repository will be created.
Either one of these processes can be done earlier than the other. To
handle this race condition, the Join ObjectPool operation is
idempotent. Given its idempotent, we can schedule it twice, with the
same effect.
To accommodate the holding of state two migrations have been added.
1. Added a state column to the pool_repositories column. This column is
managed by the state machine, allowing for hooks on transitions.
2. pool_repositories now has a source_project_id. This column in
convenient to have for multiple reasons: it has a unique index allowing
the database to handle race conditions when creating a new record. Also,
it's nice to know who the host is. As that's a short link to the fork
networks root.
Object pools are only available for public project, which use hashed
storage and when forking from the root of the fork network. (That is,
the project being forked from itself isn't a fork)
In this commit message I use both ObjectPool and Pool repositories,
which are alike, but different from each other. ObjectPool refers to
whatever is on the disk stored and managed by Gitaly. PoolRepository is
the record in the database.
2018-12-03 13:49:58 +00:00
|
|
|
|
|
|
|
```ruby
|
|
|
|
# object pool paths
|
|
|
|
"@pools/#{hash[0..1]}/#{hash[2..3]}/#{hash}.git"
|
|
|
|
```
|
|
|
|
|
2020-05-19 12:08:21 +00:00
|
|
|
### Hashed storage coverage migration
|
2018-06-27 03:01:09 +00:00
|
|
|
|
2020-05-19 12:08:21 +00:00
|
|
|
Files stored in an S3 compatible endpoint will not have the downsides
|
2018-02-08 18:33:35 +00:00
|
|
|
mentioned earlier, if they are not prefixed with `#{namespace}/#{project_name}`,
|
|
|
|
which is true for CI Cache and LFS Objects.
|
2017-10-30 13:31:10 +00:00
|
|
|
|
2020-05-19 12:08:21 +00:00
|
|
|
In the table below, you can find the coverage of the migration to the hashed storage.
|
|
|
|
|
|
|
|
| Storable Object | Legacy storage | Hashed storage | S3 Compatible | GitLab Version |
|
2017-11-08 02:36:06 +00:00
|
|
|
| --------------- | -------------- | -------------- | ------------- | -------------- |
|
2017-10-30 13:31:10 +00:00
|
|
|
| Repository | Yes | Yes | - | 10.0 |
|
|
|
|
| Attachments | Yes | Yes | - | 10.2 |
|
2017-11-08 02:36:06 +00:00
|
|
|
| Avatars | Yes | No | - | - |
|
2017-10-30 13:31:10 +00:00
|
|
|
| Pages | Yes | No | - | - |
|
|
|
|
| Docker Registry | Yes | No | - | - |
|
2017-11-08 02:36:06 +00:00
|
|
|
| CI Build Logs | No | No | - | - |
|
2018-06-27 03:01:09 +00:00
|
|
|
| CI Artifacts | No | No | Yes | 9.4 / 10.6 |
|
2017-10-30 13:31:10 +00:00
|
|
|
| CI Cache | No | No | Yes | - |
|
2018-06-27 03:01:09 +00:00
|
|
|
| LFS Objects | Yes | Similar | Yes | 10.0 / 10.7 |
|
2019-03-11 16:38:19 +00:00
|
|
|
| Repository pools| No | Yes | - | 11.6 |
|
2018-06-27 03:01:09 +00:00
|
|
|
|
2020-05-19 12:08:21 +00:00
|
|
|
#### Avatars
|
2018-06-27 03:01:09 +00:00
|
|
|
|
|
|
|
Each file is stored in a folder with its `id` from the database. The filename is always `avatar.png` for user avatars.
|
|
|
|
When avatar is replaced, `Upload` model is destroyed and a new one takes place with different `id`.
|
|
|
|
|
2020-05-19 12:08:21 +00:00
|
|
|
#### CI artifacts
|
2018-06-27 03:01:09 +00:00
|
|
|
|
|
|
|
CI Artifacts are S3 compatible since **9.4** (GitLab Premium), and available in GitLab Core since **10.6**.
|
|
|
|
|
2020-05-19 12:08:21 +00:00
|
|
|
#### LFS objects
|
2018-06-27 03:01:09 +00:00
|
|
|
|
2020-04-01 18:07:56 +00:00
|
|
|
[LFS Objects in GitLab](../topics/git/lfs/index.md) implement a similar
|
2019-10-27 06:06:30 +00:00
|
|
|
storage pattern using 2 chars, 2 level folders, following Git's own implementation:
|
2018-06-27 03:01:09 +00:00
|
|
|
|
|
|
|
```ruby
|
|
|
|
"shared/lfs-objects/#{oid[0..1}/#{oid[2..3]}/#{oid[4..-1]}"
|
|
|
|
|
|
|
|
# Based on object `oid`: `8909029eb962194cfb326259411b22ae3f4a814b5be4f80651735aeef9f3229c`, path will be:
|
|
|
|
"shared/lfs-objects/89/09/029eb962194cfb326259411b22ae3f4a814b5be4f80651735aeef9f3229c"
|
|
|
|
```
|
|
|
|
|
2020-04-09 06:09:30 +00:00
|
|
|
LFS objects are also [S3 compatible](lfs/index.md#storing-lfs-objects-in-remote-object-storage).
|
2020-05-19 12:08:21 +00:00
|
|
|
|
|
|
|
## Legacy storage
|
|
|
|
|
|
|
|
NOTE: **Deprecated:**
|
|
|
|
In GitLab 13.0, hashed storage is enabled by default and the legacy storage is
|
|
|
|
deprecated. If you haven't migrated yet, check the
|
|
|
|
[migration instructions](raketasks/storage.md#migrate-to-hashed-storage).
|
|
|
|
Support for legacy storage will be removed in GitLab 14.0. If you're on GitLab
|
|
|
|
13.0 and later, switching new projects to legacy storage is not possible.
|
|
|
|
The option to choose between hashed and legacy storage in the admin area has
|
|
|
|
been disabled.
|
|
|
|
|
|
|
|
Legacy storage is the storage behavior prior to version 10.0. For historical
|
|
|
|
reasons, GitLab replicated the same mapping structure from the projects URLs:
|
|
|
|
|
|
|
|
- Project's repository: `#{namespace}/#{project_name}.git`
|
|
|
|
- Project's wiki: `#{namespace}/#{project_name}.wiki.git`
|
|
|
|
|
|
|
|
This structure made it simple to migrate from existing solutions to GitLab and
|
|
|
|
easy for Administrators to find where the repository is stored.
|
|
|
|
|
|
|
|
On the other hand this has some drawbacks:
|
|
|
|
|
|
|
|
Storage location will concentrate huge amount of top-level namespaces. The
|
|
|
|
impact can be reduced by the introduction of
|
|
|
|
[multiple storage paths](repository_storage_paths.md).
|
|
|
|
|
|
|
|
Because backups are a snapshot of the same URL mapping, if you try to recover a
|
|
|
|
very old backup, you need to verify whether any project has taken the place of
|
|
|
|
an old removed or renamed project sharing the same URL. This means that
|
|
|
|
`mygroup/myproject` from your backup may not be the same original project that
|
|
|
|
is at that same URL today.
|
|
|
|
|
|
|
|
Any change in the URL will need to be reflected on disk (when groups / users or
|
|
|
|
projects are renamed). This can add a lot of load in big installations,
|
|
|
|
especially if using any type of network based filesystem.
|