gitlab-org--gitlab-foss/doc/topics/git/partial_clone.md

209 lines
8.4 KiB
Markdown

---
stage: Create
group: Source Code
info: "To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments"
type: reference, howto
---
# Partial clone **(FREE)**
As Git repositories grow in size, they can become cumbersome to work with
because of:
- The large amount of history that must be downloaded.
- The large amount of disk space they require.
[Partial clone](https://github.com/git/git/blob/master/Documentation/technical/partial-clone.txt)
is a performance optimization that "allows Git to function without having a
complete copy of the repository. The goal of this work is to allow Git better
handle extremely large repositories."
Git 2.22.0 or later is required.
## Filter by file size
> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/2553) in GitLab 12.10.
Storing large binary files in Git is normally discouraged, because every large
file added is downloaded by everyone who clones or fetches changes
thereafter. These downloads are slow and problematic, especially when working from a slow
or unreliable internet connection.
Using partial clone with a file size filter solves this problem, by excluding
troublesome large files from clones and fetches. When Git encounters a missing
file, it's downloaded on demand.
When cloning a repository, use the `--filter=blob:limit=<size>` argument. For example,
to clone the repository excluding files larger than 1 megabyte:
```shell
git clone --filter=blob:limit=1m git@gitlab.com:gitlab-com/www-gitlab-com.git
```
This would produce the following output:
```plaintext
Cloning into 'www-gitlab-com'...
remote: Enumerating objects: 832467, done.
remote: Counting objects: 100% (832467/832467), done.
remote: Compressing objects: 100% (207226/207226), done.
remote: Total 832467 (delta 585563), reused 826624 (delta 580099), pack-reused 0
Receiving objects: 100% (832467/832467), 2.34 GiB | 5.05 MiB/s, done.
Resolving deltas: 100% (585563/585563), done.
remote: Enumerating objects: 146, done.
remote: Counting objects: 100% (146/146), done.
remote: Compressing objects: 100% (138/138), done.
remote: Total 146 (delta 8), reused 144 (delta 8), pack-reused 0
Receiving objects: 100% (146/146), 471.45 MiB | 4.60 MiB/s, done.
Resolving deltas: 100% (8/8), done.
Updating files: 100% (13008/13008), done.
Filtering content: 100% (3/3), 131.24 MiB | 4.65 MiB/s, done.
```
The output is longer because Git:
1. Clones the repository excluding files larger than 1 megabyte.
1. Downloads any missing large files needed to check out the default branch.
When changing branches, Git may download more missing files.
## Filter by object type
> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/2553) in GitLab 12.10.
For repositories with millions of files and a long history, you can exclude all files and use
[`git sparse-checkout`](https://git-scm.com/docs/git-sparse-checkout) to reduce the size of
your working copy.
```plaintext
# Clone the repo excluding all files
$ git clone --filter=blob:none --sparse git@gitlab.com:gitlab-com/www-gitlab-com.git
Cloning into 'www-gitlab-com'...
remote: Enumerating objects: 678296, done.
remote: Counting objects: 100% (678296/678296), done.
remote: Compressing objects: 100% (165915/165915), done.
remote: Total 678296 (delta 472342), reused 673292 (delta 467476), pack-reused 0
Receiving objects: 100% (678296/678296), 81.06 MiB | 5.74 MiB/s, done.
Resolving deltas: 100% (472342/472342), done.
remote: Enumerating objects: 28, done.
remote: Counting objects: 100% (28/28), done.
remote: Compressing objects: 100% (25/25), done.
remote: Total 28 (delta 0), reused 12 (delta 0), pack-reused 0
Receiving objects: 100% (28/28), 140.29 KiB | 341.00 KiB/s, done.
Updating files: 100% (28/28), done.
$ cd www-gitlab-com
$ git sparse-checkout init --cone
$ git sparse-checkout add data
remote: Enumerating objects: 301, done.
remote: Counting objects: 100% (301/301), done.
remote: Compressing objects: 100% (292/292), done.
remote: Total 301 (delta 16), reused 102 (delta 9), pack-reused 0
Receiving objects: 100% (301/301), 1.15 MiB | 608.00 KiB/s, done.
Resolving deltas: 100% (16/16), done.
Updating files: 100% (302/302), done.
```
For more details, see the Git documentation for
[`sparse-checkout`](https://git-scm.com/docs/git-sparse-checkout).
## Filter by file path
Deeper integration between partial clone and sparse checkout is possible through the
`--filter=sparse:oid=<blob-ish>` filter spec. This mode of filtering uses a format similar to a
`.gitignore` file to specify which files to include when cloning and fetching.
WARNING:
Partial clone using `sparse` filters is still experimental. It might be slow and significantly increase
[Gitaly](../../administration/gitaly/index.md) resource utilization when cloning and fetching.
[Filter all blobs and use sparse-checkout](#filter-by-object-type) instead, because
[`git-sparse-checkout`](https://git-scm.com/docs/git-sparse-checkout) simplifies
this type of partial clone use and overcomes its limitations.
For more details, see the Git documentation for
[`rev-list-options`](https://git-scm.com/docs/git-rev-list#Documentation/git-rev-list.txt---filterltfilter-specgt).
1. Create a filter spec. For example, consider a monolithic repository with many applications,
each in a different subdirectory in the root. Create a file `shiny-app/.filterspec`:
```plaintext
# Only the paths listed in the file will be downloaded when performing a
# partial clone using `--filter=sparse:oid=shiny-app/.gitfilterspec`
# Explicitly include filterspec needed to configure sparse checkout with
# git config --local core.sparsecheckout true
# git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout
shiny-app/.gitfilterspec
# Shiny App
shiny-app/
# Dependencies
shimmery-app/
shared-component-a/
shared-component-b/
```
1. Clone and filter by path. Support for `--filter=sparse:oid` using the
clone command is not fully integrated with sparse checkout.
```shell
# Clone the filtered set of objects using the filterspec stored on the
# server. WARNING: this step may be very slow!
git clone --sparse --filter=sparse:oid=master:shiny-app/.gitfilterspec <url>
# Optional: observe there are missing objects that we have not fetched
git rev-list --all --quiet --objects --missing=print | wc -l
```
WARNING:
Git integrations with `bash`, Zsh, etc and editors that automatically
show Git status information often run `git fetch` which fetches the
entire repository. Disabling or reconfiguring these integrations might be required.
## Remove partial clone filtering
Git repositories with partial clone filtering can have the filtering removed. To
remove filtering:
1. Fetch everything that has been excluded by the filters, to make sure that the
repository is complete. If `git sparse-checkout` was used, use
`git sparse-checkout disable` to disable it. See the
[`disable` documentation](https://git-scm.com/docs/git-sparse-checkout#Documentation/git-sparse-checkout.txt-emdisableem)
for more information.
Then do a regular `fetch` to ensure that the repository is complete. To check if
there are missing objects to fetch, and then fetch them, especially when not using
`git sparse-checkout`, the following commands can be used:
```shell
# Show missing objects
git rev-list --objects --all --missing=print | grep -e '^\?'
# Show missing objects without a '?' character before them (needs GNU grep)
git rev-list --objects --all --missing=print | grep -oP '^\?\K\w+'
# Fetch missing objects
git fetch origin $(git rev-list --objects --all --missing=print | grep -oP '^\?\K\w+')
# Show number of missing objects
git rev-list --objects --all --missing=print | grep -e '^\?' | wc -l
```
1. Repack everything. This can be done using `git repack -a -d`, for example. This
should leave only three files in `.git/objects/pack/`:
- A `pack-<SHA1>.pack` file.
- Its corresponding `pack-<SHA1>.idx` file.
- A `pack-<SHA1>.promisor` file.
1. Delete the `.promisor` file. The above step should have left only one
`pack-<SHA1>.promisor` file, which should be empty and should be deleted.
1. Remove partial clone configuration. The partial clone-related configuration
variables should be removed from Git configuration files. Usually only the following
configuration must be removed:
- `remote.origin.promisor`.
- `remote.origin.partialclonefilter`.