2019-05-30 02:23:46 -04:00
---
2020-05-29 14:08:26 -04:00
stage: Verify
group: Runner
2022-09-21 17:13:33 -04:00
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
2019-05-30 02:23:46 -04:00
type: reference
---
2021-09-03 02:09:25 -04:00
# Optimize GitLab for large repositories **(FREE)**
2019-04-04 11:31:56 -04:00
Large repositories consisting of more than 50k files in a worktree
2021-09-03 02:09:25 -04:00
may require more optimizations beyond
[pipeline efficiency ](../pipelines/pipeline_efficiency.md )
because of the time required to clone and check out.
2019-04-04 11:31:56 -04:00
GitLab and GitLab Runner handle this scenario well
2019-04-09 06:48:07 -04:00
but require optimized configuration to efficiently perform its
2019-04-04 11:31:56 -04:00
set of operations.
The general guidelines for handling big repositories are simple.
Each guideline is described in more detail in the sections below:
- Always fetch incrementally. Do not clone in a way that results in recreating all of the worktree.
- Always use shallow clone to reduce data transfer. Be aware that this puts more burden
on GitLab instance due to higher CPU impact.
- Control the clone directory if you heavily use a fork-based workflow.
2019-04-09 06:48:07 -04:00
- Optimize `git clean` flags to ensure that you remove or keep data that might affect or speed-up your build.
2019-04-04 11:31:56 -04:00
## Shallow cloning
> Introduced in GitLab Runner 8.9.
2021-07-08 11:10:06 -04:00
GitLab and GitLab Runner perform a [shallow clone ](../pipelines/settings.md#limit-the-number-of-changes-fetched-during-clone )
2020-09-07 20:08:20 -04:00
by default.
2019-04-04 11:31:56 -04:00
Ideally, you should always use `GIT_DEPTH` with a small number
2020-11-22 13:09:29 -05:00
like 10. This instructs GitLab Runner to perform shallow clones.
2020-01-20 01:08:35 -05:00
Shallow clones make Git request only the latest set of changes for a given branch,
2019-04-04 11:31:56 -04:00
up to desired number of commits as defined by the `GIT_DEPTH` variable.
This significantly speeds up fetching of changes from Git repositories,
especially if the repository has a very long backlog consisting of number
of big files as we effectively reduce amount of data transfer.
2020-08-20 23:10:16 -04:00
The following example makes the runner shallow clone to fetch only a given branch;
2019-04-04 11:31:56 -04:00
it does not fetch any other branches nor tags.
```yaml
variables:
GIT_DEPTH: 10
test:
script:
- ls -al
```
## Git strategy
> Introduced in GitLab Runner 8.9.
2021-06-10 08:10:09 -04:00
By default, GitLab is configured to use the [`fetch` Git strategy ](../runners/configure_runners.md#git-strategy ),
2020-11-10 10:09:14 -05:00
which is recommended for large repositories.
This strategy reduces the amount of data to transfer and
2019-04-04 11:31:56 -04:00
does not really impact the operations that you might do on a repository from CI.
## Git clone path
> Introduced in GitLab Runner 11.10.
2021-06-10 08:10:09 -04:00
[`GIT_CLONE_PATH` ](../runners/configure_runners.md#custom-build-directories ) allows you to
2019-04-23 00:11:19 -04:00
control where you clone your sources. This can have implications if you
heavily use big repositories with fork workflow.
2019-04-04 11:31:56 -04:00
Fork workflow from GitLab Runner's perspective is stored as a separate repository
2019-04-09 06:48:07 -04:00
with separate worktree. That means that GitLab Runner cannot optimize the usage
2019-04-04 11:31:56 -04:00
of worktrees and you might have to instruct GitLab Runner to use that.
2019-09-03 14:08:53 -04:00
In such cases, ideally you want to make the GitLab Runner executor be used only
2019-04-04 11:31:56 -04:00
for the given project and not shared across different projects to make this
process more efficient.
2021-06-10 08:10:09 -04:00
The [`GIT_CLONE_PATH` ](../runners/configure_runners.md#custom-build-directories ) has to be
2019-04-23 00:11:19 -04:00
within the `$CI_BUILDS_DIR` . Currently, it is impossible to pick any path
from disk.
2019-04-04 11:31:56 -04:00
## Git clean flags
> Introduced in GitLab Runner 11.10.
2021-06-10 08:10:09 -04:00
[`GIT_CLEAN_FLAGS` ](../runners/configure_runners.md#git-clean-flags ) allows you to control
2019-04-23 00:11:19 -04:00
whether or not you require the `git clean` command to be executed for each CI
job. By default, GitLab ensures that you have your worktree on the given SHA,
2019-04-04 11:31:56 -04:00
and that your repository is clean.
2021-06-10 08:10:09 -04:00
[`GIT_CLEAN_FLAGS` ](../runners/configure_runners.md#git-clean-flags ) is disabled when set
2019-04-23 00:11:19 -04:00
to `none` . On very big repositories, this might be desired because `git
clean` is disk I/O intensive. Controlling that with `GIT_CLEAN_FLAGS: -ffdx
2020-02-27 04:09:01 -05:00
-e .build/` (for example) allows you to control and disable removal of some
2019-04-23 00:11:19 -04:00
directories within the worktree between subsequent runs, which can speed-up
the incremental builds. This has the biggest effect if you re-use existing
2020-01-20 01:08:35 -05:00
machines and have an existing worktree that you can re-use for builds.
2019-04-23 00:11:19 -04:00
For exact parameters accepted by
2021-06-10 08:10:09 -04:00
[`GIT_CLEAN_FLAGS` ](../runners/configure_runners.md#git-clean-flags ), see the documentation
2019-09-23 05:06:22 -04:00
for [`git clean` ](https://git-scm.com/docs/git-clean ). The available parameters
2019-04-23 00:11:19 -04:00
are dependent on Git version.
2019-04-04 11:31:56 -04:00
2020-05-22 05:08:09 -04:00
## Git fetch extra flags
> [Introduced](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4142) in GitLab Runner 13.1.
2021-06-10 08:10:09 -04:00
[`GIT_FETCH_EXTRA_FLAGS` ](../runners/configure_runners.md#git-fetch-extra-flags ) allows you
2020-05-22 05:08:09 -04:00
to modify `git fetch` behavior by passing extra flags.
2020-07-27 17:09:16 -04:00
For example, if your project contains a large number of tags that your CI jobs don't rely on,
you could add [`--no-tags` ](https://git-scm.com/docs/git-fetch#Documentation/git-fetch.txt---no-tags )
to the extra flags to make your fetches faster and more compact.
2021-01-25 16:09:03 -05:00
Also in the case where you repository does _not_ contain a lot of
2022-08-07 23:09:21 -04:00
tags, `--no-tags` can [make a big difference in some cases ](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/746 ).
2021-01-25 16:09:03 -05:00
If your CI builds do not depend on Git tags it is worth trying.
2021-06-10 08:10:09 -04:00
See the [`GIT_FETCH_EXTRA_FLAGS` documentation ](../runners/configure_runners.md#git-fetch-extra-flags )
2020-05-22 05:08:09 -04:00
for more information.
2019-04-04 11:31:56 -04:00
## Fork-based workflow
> Introduced in GitLab Runner 11.10.
2020-01-20 01:08:35 -05:00
Following the guidelines above, let's imagine that we want to:
2019-04-04 11:31:56 -04:00
2019-04-09 06:48:07 -04:00
- Optimize for a big project (more than 50k files in directory).
2019-04-04 11:31:56 -04:00
- Use forks-based workflow for contributing.
- Reuse existing worktrees. Have preconfigured runners that are pre-cloned with repositories.
- Runner assigned only to project and all forks.
2020-01-20 01:08:35 -05:00
Let's consider the following two examples, one using `shell` executor and
2019-04-04 11:31:56 -04:00
other using `docker` executor.
### `shell` executor example
2020-05-06 11:09:42 -04:00
Let's assume that you have the following [`config.toml` ](https://docs.gitlab.com/runner/configuration/advanced-configuration.html ).
2019-04-04 11:31:56 -04:00
```toml
concurrent = 4
[[runners]]
url = "GITLAB_URL"
token = "TOKEN"
executor = "shell"
builds_dir = "/builds"
cache_dir = "/cache"
[runners.custom_build_dir]
enabled = true
```
This `config.toml` :
- Uses the `shell` executor,
2020-11-22 13:09:29 -05:00
- Specifies a custom `/builds` directory where all clones are stored.
2019-04-04 11:31:56 -04:00
- Enables the ability to specify `GIT_CLONE_PATH` ,
- Runs at most 4 jobs at once.
### `docker` executor example
2020-05-06 11:09:42 -04:00
Let's assume that you have the following [`config.toml` ](https://docs.gitlab.com/runner/configuration/advanced-configuration.html ).
2019-04-04 11:31:56 -04:00
```toml
concurrent = 4
[[runners]]
url = "GITLAB_URL"
token = "TOKEN"
executor = "docker"
builds_dir = "/builds"
cache_dir = "/cache"
[runners.docker]
volumes = ["/builds:/builds", "/cache:/cache"]
```
This `config.toml` :
- Uses the `docker` executor,
2020-11-22 13:09:29 -05:00
- Specifies a custom `/builds` directory on disk where all clones are stored.
2019-04-04 11:31:56 -04:00
We host mount the `/builds` directory to make it reusable between subsequent runs
and be allowed to override the cloning strategy.
- Doesn't enable the ability to specify `GIT_CLONE_PATH` as it is enabled by default.
- Runs at most 4 jobs at once.
### Our `.gitlab-ci.yml`
Once we have the executor configured, we need to fine tune our `.gitlab-ci.yml` .
2020-11-22 13:09:29 -05:00
Our pipeline is most performant if we use the following `.gitlab-ci.yml` :
2019-04-04 11:31:56 -04:00
```yaml
variables:
GIT_DEPTH: 10
GIT_CLONE_PATH: $CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME
build:
script: ls -al
```
The above configures a:
- Shallow clone of 10, to speed up subsequent `git fetch` commands.
- Custom clone path to make it possible to re-use worktrees between parent project and all forks
because we use the same clone path for all forks.
Why use `$CI_CONCURRENT_ID` ? The main reason is to ensure that worktrees used are not conflicting
2020-11-22 13:09:29 -05:00
between projects. The `$CI_CONCURRENT_ID` represents a unique identifier within the given executor.
When we use it to construct the path, this directory does not conflict
2019-04-04 11:31:56 -04:00
with other concurrent jobs running.
### Store custom clone options in `config.toml`
Ideally, all job-related configuration should be stored in `.gitlab-ci.yml` .
2020-08-20 23:10:16 -04:00
However, sometimes it is desirable to make these schemes part of the runner's configuration.
2019-04-04 11:31:56 -04:00
In the above example of Forks, making this configuration discoverable for users may be preferred,
but this brings administrative overhead as the `.gitlab-ci.yml` needs to be updated for each branch.
In such cases, it might be desirable to keep the `.gitlab-ci.yml` clone path agnostic, but make it
2020-08-20 23:10:16 -04:00
a configuration of the runner.
2019-04-04 11:31:56 -04:00
2020-05-06 11:09:42 -04:00
We can extend our [`config.toml` ](https://docs.gitlab.com/runner/configuration/advanced-configuration.html )
2020-11-22 13:09:29 -05:00
with the following specification that is used by the runner if `.gitlab-ci.yml` does not override it:
2019-04-04 11:31:56 -04:00
```toml
concurrent = 4
[[runners]]
url = "GITLAB_URL"
token = "TOKEN"
executor = "docker"
builds_dir = "/builds"
cache_dir = "/cache"
environment = [
"GIT_DEPTH=10",
"GIT_CLONE_PATH=$CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME"
]
[runners.docker]
volumes = ["/builds:/builds", "/cache:/cache"]
```
2020-08-20 23:10:16 -04:00
This makes the cloning configuration to be part of the given runner
2019-04-04 11:31:56 -04:00
and does not require us to update each `.gitlab-ci.yml` .
2020-07-27 17:09:16 -04:00
2021-11-05 08:10:25 -04:00
## Git fetch caching or pre-clone step
For very active repositories with a large number of references and files, you can either (or both):
- Consider using the [Gitaly pack-objects cache ](../../administration/gitaly/configure_gitaly.md#pack-objects-cache ) instead of a
pre-clone step. This is easier to set up and it benefits all repositories on your GitLab server, unlike the pre-clone step that
2021-11-07 19:12:34 -05:00
must be configured per-repository. The pack-objects cache also automatically works for forks. On GitLab.com, where the pack-objects cache is
enabled on all Gitaly servers, we found that we no longer need a pre-clone step for `gitlab-org/gitlab` development.
2021-11-05 08:10:25 -04:00
- Optimize your CI/CD jobs by seeding repository data in a pre-clone step with the
2021-11-24 16:12:47 -05:00
[`pre_clone_script` ](https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-section ) of GitLab Runner. See
2021-11-24 19:10:49 -05:00
[SaaS runners on Linux ](../runners/saas/linux_saas_runner.md#pre-clone-script ) for details.
2022-09-28 11:09:17 -04:00
Besides speeding up pipelines in large and active projects,
seeding the repository data also helps avoid
`429 Too many requests` errors from Cloudflare.
This error can occur if you have many runners behind a single,
NAT'ed IP address that pulls from GitLab.com.