2019-08-05 07:41:31 -04:00
|
|
|
|
# Partial Clone for Large Repositories
|
|
|
|
|
|
|
|
|
|
CAUTION: **Alpha:**
|
|
|
|
|
Partial Clone is an experimental feature, and will significantly increase
|
|
|
|
|
Gitaly resource utilization when performing a partial clone, and decrease
|
|
|
|
|
performance of subsequent fetch operations.
|
|
|
|
|
|
|
|
|
|
As Git repositories become very large, usability decreases as performance
|
|
|
|
|
decreases. One major challenge is cloning the repository, because Git will
|
|
|
|
|
download the entire repository including every commit and every version of
|
|
|
|
|
every object. This can be slow to transfer, and require large amounts of disk
|
|
|
|
|
space.
|
|
|
|
|
|
|
|
|
|
Historically, performing a **shallow clone**
|
|
|
|
|
([`--depth`](https://www.git-scm.com/docs/git-clone#Documentation/git-clone.txt---depthltdepthgt))
|
|
|
|
|
has been the only way to reduce the amount of data transferred when cloning
|
|
|
|
|
a Git repository. This does not, however, allow filtering by sub-tree which is
|
|
|
|
|
important for monolithic repositories containing many projects, or by object
|
|
|
|
|
size preventing unnecessary large objects being downloaded.
|
|
|
|
|
|
|
|
|
|
[Partial clone](https://github.com/git/git/blob/master/Documentation/technical/partial-clone.txt)
|
|
|
|
|
is a performance optimization that "allows Git to function without having a
|
|
|
|
|
complete copy of the repository. The goal of this work is to allow Git better
|
|
|
|
|
handle extremely large repositories."
|
|
|
|
|
|
|
|
|
|
Specifically, using partial clone, it should be possible for Git to natively
|
|
|
|
|
support:
|
|
|
|
|
|
|
|
|
|
- large objects, instead of using [Git LFS](https://git-lfs.github.com/)
|
|
|
|
|
- enormous repositories
|
|
|
|
|
|
|
|
|
|
Briefly, partial clone works by:
|
|
|
|
|
|
|
|
|
|
- excluding objects from being transferred when cloning or fetching a
|
2019-08-05 08:27:58 -04:00
|
|
|
|
repository using a new `--filter` flag
|
2019-08-05 07:41:31 -04:00
|
|
|
|
- downloading missing objects on demand
|
|
|
|
|
|
|
|
|
|
Follow [Git for enormous repositories](https://gitlab.com/groups/gitlab-org/-/epics/773) for roadmap and updates.
|
|
|
|
|
|
|
|
|
|
## Enabling partial clone
|
|
|
|
|
|
|
|
|
|
GitLab 12.1 uses Git 2.21.0 which has an arbitrary file access security
|
|
|
|
|
vulnerability when `uploadpack.allowFilter` is enabled, and should not be
|
|
|
|
|
enabled in production environments.
|
|
|
|
|
|
|
|
|
|
A feature flag is planned to enable `uploadpack.allowFilter` and
|
|
|
|
|
`uploadpack.allowAnySHA1InWant` once the version of Git used by GitLab has been
|
|
|
|
|
updated to Git 2.22.0.
|
|
|
|
|
|
|
|
|
|
Follow [this issue](https://gitlab.com/gitlab-org/gitaly/issues/1553) for
|
|
|
|
|
updated.
|
|
|
|
|
|
|
|
|
|
## Excluding objects by size
|
|
|
|
|
|
|
|
|
|
Partial Clone allows large objects to be stored directly in the Git repository,
|
|
|
|
|
and be excluded from clones as desired by the user. This eliminates the error
|
|
|
|
|
prone process of deciding which objects should be stored in LFS or not. Using
|
|
|
|
|
partial clone, all files – large or small – may be treated the same.
|
|
|
|
|
|
|
|
|
|
With the `uploadpack.allowFilter` and `uploadpack.allowAnySHA1InWant` options
|
|
|
|
|
enabled on the Git server:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# clone the repo, excluding blobs larger than 1 megabyte
|
|
|
|
|
git clone --filter=blob:limit=1m <url>
|
|
|
|
|
|
|
|
|
|
# in the checkout step of the clone, and any subsequent operations
|
|
|
|
|
# any blobs that are needed will be downloaded on demand
|
|
|
|
|
git checkout feature-branch
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Excluding objects by path
|
|
|
|
|
|
|
|
|
|
Partial Clone allows clones to be filtered by path using a format similar to a
|
|
|
|
|
`.gitignore` file stored inside the repository.
|
|
|
|
|
|
|
|
|
|
With the `uploadpack.allowFilter` and `uploadpack.allowAnySHA1InWant` options
|
|
|
|
|
enabled on the Git server:
|
|
|
|
|
|
|
|
|
|
1. **Create a filter spec.** For example, consider a monolithic repository with
|
2019-08-05 08:27:58 -04:00
|
|
|
|
many applications, each in a different subdirectory in the root. Create a file
|
|
|
|
|
`shiny-app/.filterspec` using the GitLab web interface:
|
|
|
|
|
|
|
|
|
|
```.gitignore
|
|
|
|
|
# Only the paths listed in the file will be downloaded when performing a
|
|
|
|
|
# partial clone using `--filter=sparse:oid=shiny-app/.gitfilterspec`
|
|
|
|
|
|
|
|
|
|
# Explicitly include filterspec needed to configure sparse checkout with
|
|
|
|
|
# git config --local core.sparsecheckout true
|
|
|
|
|
# git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout
|
|
|
|
|
shiny-app/.gitfilterspec
|
|
|
|
|
|
|
|
|
|
# Shiny App
|
|
|
|
|
shiny-app/
|
|
|
|
|
|
|
|
|
|
# Dependencies
|
|
|
|
|
shimmery-app/
|
|
|
|
|
shared-component-a/
|
|
|
|
|
shared-component-b/
|
|
|
|
|
```
|
2019-08-05 07:41:31 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
1. *Create a new Git repository and fetch.* Support for `--filter=sparse:oid`
|
2019-08-05 08:27:58 -04:00
|
|
|
|
using the clone command is incomplete, so we will emulate the clone command
|
|
|
|
|
by hand, using `git init` and `git fetch`. Follow
|
|
|
|
|
[gitaly#1769](https://gitlab.com/gitlab-org/gitaly/issues/1769) for updates.
|
2019-08-05 07:41:31 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
```bash
|
|
|
|
|
# Create a new directory for the Git repository
|
|
|
|
|
mkdir jumbo-repo && cd jumbo-repo
|
2019-08-05 08:27:58 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
# Initialize a new Git repository
|
|
|
|
|
git init
|
2019-08-05 08:27:58 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
# Add the remote
|
2019-08-28 05:00:45 -04:00
|
|
|
|
git remote add origin <url>
|
2019-08-05 08:27:58 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
# Enable partial clone support for the remote
|
|
|
|
|
git config --local extensions.partialClone origin
|
2019-08-05 08:27:58 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
# Fetch the filtered set of objects using the filterspec stored on the
|
|
|
|
|
# server. WARNING: this step is slow!
|
|
|
|
|
git fetch --filter=sparse:oid=master:shiny-app/.gitfilterspec origin
|
2019-08-05 08:27:58 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
# Optional: observe there are missing objects that we have not fetched
|
|
|
|
|
git rev-list --all --quiet --objects --missing=print | wc -l
|
|
|
|
|
```
|
2019-08-05 08:27:58 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
CAUTION: **IDE and Shell integrations:**
|
|
|
|
|
Git integrations with `bash`, `zsh`, etc and editors that automatically
|
|
|
|
|
show Git status information often run `git fetch` which will fetch the
|
|
|
|
|
entire repository. You many need to disable or reconfigure these
|
|
|
|
|
integrations.
|
2019-08-05 07:41:31 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
1. **Sparse checkout** must be enabled and configured to prevent objects from
|
2019-08-05 08:27:58 -04:00
|
|
|
|
other paths being downloaded automatically when checking out branches. Follow
|
|
|
|
|
[gitaly#1765](https://gitlab.com/gitlab-org/gitaly/issues/1765) for updates.
|
2019-08-05 07:41:31 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
```bash
|
|
|
|
|
# Enable sparse checkout
|
|
|
|
|
git config --local core.sparsecheckout true
|
2019-08-05 08:27:58 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
# Configure sparse checkout
|
|
|
|
|
git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout
|
2019-08-05 07:41:31 -04:00
|
|
|
|
|
2019-08-12 00:23:01 -04:00
|
|
|
|
# Checkout master
|
|
|
|
|
git checkout master
|
|
|
|
|
```
|