Add usefull tips about big repositories
This commit is contained in:
parent
50cd5d9b77
commit
c93779accb
2 changed files with 236 additions and 0 deletions
|
@ -62,6 +62,7 @@ into more features:
|
|||
| [ChatOps](chatops/README.md) | Trigger CI jobs from chat, with results sent back to the channel. |
|
||||
| [Interactive web terminals](interactive_web_terminal/index.md) | Open an interactive web terminal to debug the running jobs. |
|
||||
| [Review Apps](review_apps/index.md) | Configure GitLab CI/CD to preview code changes in a per-branch basis. |
|
||||
| [Optimising GitLab for large repositories](large_repositories/index.md) | Useful tips on how to optimise GitLab and GitLab Runner for big repositories. |
|
||||
| [Deploy Boards](https://docs.gitlab.com/ee/user/project/deploy_boards.html) **[PREMIUM]** | Check the current health and status of each CI/CD environment running on Kubernetes. |
|
||||
| [GitLab CI/CD for external repositories](https://docs.gitlab.com/ee/ci/ci_cd_for_external_repos/index.html) **[PREMIUM]** | Get the benefits of GitLab CI/CD combined with repositories in GitHub and BitBucket Cloud. |
|
||||
|
||||
|
|
235
doc/ci/large_repositories/index.md
Normal file
235
doc/ci/large_repositories/index.md
Normal file
|
@ -0,0 +1,235 @@
|
|||
# Optimising GitLab for large repositories
|
||||
|
||||
Large repositories consisting of more than 50k files in a worktree
|
||||
often require special consideration because of
|
||||
the time required to clone and check out.
|
||||
|
||||
GitLab and GitLab Runner handle this scenario well
|
||||
but require optimised configuration to efficiently perform its
|
||||
set of operations.
|
||||
|
||||
The general guidelines for handling big repositories are simple.
|
||||
Each guideline is described in more detail in the sections below:
|
||||
|
||||
- Always fetch incrementally. Do not clone in a way that results in recreating all of the worktree.
|
||||
- Always use shallow clone to reduce data transfer. Be aware that this puts more burden
|
||||
on GitLab instance due to higher CPU impact.
|
||||
- Control the clone directory if you heavily use a fork-based workflow.
|
||||
- Optimise `git clean` flags to ensure that you remove or keep data that might affect or speed-up your build.
|
||||
|
||||
## Shallow cloning
|
||||
|
||||
> Introduced in GitLab Runner 8.9.
|
||||
|
||||
GitLab and GitLab Runner always perform a full clone by default.
|
||||
While it means that all changes from GitLab are received,
|
||||
it often results in receiving extra commit logs.
|
||||
|
||||
Ideally, you should always use `GIT_DEPTH` with a small number
|
||||
like 10. This will instruct GitLab Runner to perform shallow clones.
|
||||
Shallow clones makes Git request only the latest set of changes for a given branch,
|
||||
up to desired number of commits as defined by the `GIT_DEPTH` variable.
|
||||
|
||||
This significantly speeds up fetching of changes from Git repositories,
|
||||
especially if the repository has a very long backlog consisting of number
|
||||
of big files as we effectively reduce amount of data transfer.
|
||||
|
||||
The following example makes GitLab Runner shallow clone to fetch only a given branch,
|
||||
it does not fetch any other branches nor tags.
|
||||
|
||||
```yaml
|
||||
variables:
|
||||
GIT_DEPTH: 10
|
||||
|
||||
test:
|
||||
script:
|
||||
- ls -al
|
||||
```
|
||||
|
||||
## Git strategy
|
||||
|
||||
> Introduced in GitLab Runner 8.9.
|
||||
|
||||
By default, GitLab is configured to always prefer the `GIT_STRATEGY: fetch` strategy.
|
||||
The `GIT_STRATEGY: fetch` strategy will re-use existing worktrees if found
|
||||
on disk. This is different to the `GIT_STRATEGY: clone` strategy
|
||||
as in case of clones, if a worktree is found, it is removed before clone.
|
||||
|
||||
Usage of `fetch` is preferred because it reduces the amount of data to transfer and
|
||||
does not really impact the operations that you might do on a repository from CI.
|
||||
|
||||
However, `fetch` does require access to the previous worktree. This works
|
||||
well when using the `shell` or `docker` executor because these
|
||||
try to preserve worktrees and try to re-use them by default.
|
||||
|
||||
This does not work today for `kubernetes` executor and has limitations when using
|
||||
`docker+machine`. `kubernetes` executor today always clones into ephemeral directory.
|
||||
|
||||
GitLab also offers the `GIT_STRATEGY: none` strategy. This disables any `fetch` and `checkout` commands
|
||||
done by GitLab, requiring you to do them.
|
||||
|
||||
## Git clone path
|
||||
|
||||
> Introduced in GitLab Runner 11.10.
|
||||
|
||||
`GIT_CLONE_PATH` allows you to control where you clone your sources.
|
||||
This can have implications if you heavily use big repositories with fork workflow.
|
||||
|
||||
Fork workflow from GitLab Runner's perspective is stored as a separate repository
|
||||
with separate worktree. That means that GitLab Runner cannot optimise the usage
|
||||
of worktrees and you might have to instruct GitLab Runner to use that.
|
||||
|
||||
In such cases, ideally you want to make the GitLab Runner executor be used only used only
|
||||
for the given project and not shared across different projects to make this
|
||||
process more efficient.
|
||||
|
||||
The `GIT_CLONE_PATH` has to be within the `$CI_BUILDS_DIR`. Currently,
|
||||
it is impossible to pick any path from disk.
|
||||
|
||||
## Git clean flags
|
||||
|
||||
> Introduced in GitLab Runner 11.10.
|
||||
|
||||
`GIT_CLEAN_FLAGS` allows you to control whether or not you require
|
||||
the `git clean` command to be executed for each CI job.
|
||||
By default, GitLab ensures that you have your worktree on the given SHA,
|
||||
and that your repository is clean.
|
||||
|
||||
`GIT_CLEAN_FLAGS` is disabled when set to `none`. On very big repositories, this
|
||||
might be desired because `git clean` is disk I/O intensive. Controlling that
|
||||
with `GIT_CLEAN_FLAGS: -ffdx -e .build/`, for example, allows you to control and
|
||||
disable removal of some directories within the worktree between subsequent runs,
|
||||
which can speed-up the incremental builds. This has the biggest effect
|
||||
if you re-use existing machines, and have an existing worktree that you can re-use
|
||||
for builds.
|
||||
|
||||
For exact parameters accepted by `GIT_CLEAN_FLAGS`, see the documentation
|
||||
for [git clean](https://git-scm.com/docs/git-clean). The
|
||||
available parameters are dependent on Git version.
|
||||
|
||||
## Fork-based workflow
|
||||
|
||||
> Introduced in GitLab Runner 11.10.
|
||||
|
||||
Following the guidelines above, lets imagine that we want to:
|
||||
|
||||
- Optimise for a big project (more than 50k files in directory).
|
||||
- Use forks-based workflow for contributing.
|
||||
- Reuse existing worktrees. Have preconfigured runners that are pre-cloned with repositories.
|
||||
- Runner assigned only to project and all forks.
|
||||
|
||||
Lets consider the following two examples, one using `shell` executor and
|
||||
other using `docker` executor.
|
||||
|
||||
### `shell` executor example
|
||||
|
||||
Lets assume that you have the following [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html).
|
||||
|
||||
```toml
|
||||
concurrent = 4
|
||||
|
||||
[[runners]]
|
||||
url = "GITLAB_URL"
|
||||
token = "TOKEN"
|
||||
executor = "shell"
|
||||
builds_dir = "/builds"
|
||||
cache_dir = "/cache"
|
||||
|
||||
[runners.custom_build_dir]
|
||||
enabled = true
|
||||
```
|
||||
|
||||
This `config.toml`:
|
||||
|
||||
- Uses the `shell` executor,
|
||||
- Specifies a custom `/builds` directory where all clones will be stored.
|
||||
- Enables the ability to specify `GIT_CLONE_PATH`,
|
||||
- Runs at most 4 jobs at once.
|
||||
|
||||
### `docker` executor example
|
||||
|
||||
Lets assume that you have the following [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html).
|
||||
|
||||
```toml
|
||||
concurrent = 4
|
||||
|
||||
[[runners]]
|
||||
url = "GITLAB_URL"
|
||||
token = "TOKEN"
|
||||
executor = "docker"
|
||||
builds_dir = "/builds"
|
||||
cache_dir = "/cache"
|
||||
|
||||
[runners.docker]
|
||||
volumes = ["/builds:/builds", "/cache:/cache"]
|
||||
```
|
||||
|
||||
This `config.toml`:
|
||||
|
||||
- Uses the `docker` executor,
|
||||
- Specifies a custom `/builds` directory on disk where all clones will be stored.
|
||||
We host mount the `/builds` directory to make it reusable between subsequent runs
|
||||
and be allowed to override the cloning strategy.
|
||||
- Doesn't enable the ability to specify `GIT_CLONE_PATH` as it is enabled by default.
|
||||
- Runs at most 4 jobs at once.
|
||||
|
||||
### Our `.gitlab-ci.yml`
|
||||
|
||||
Once we have the executor configured, we need to fine tune our `.gitlab-ci.yml`.
|
||||
|
||||
Our pipeline will be most performant if we use the following `.gitlab-ci.yml`:
|
||||
|
||||
```yaml
|
||||
variables:
|
||||
GIT_DEPTH: 10
|
||||
GIT_CLONE_PATH: $CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME
|
||||
|
||||
build:
|
||||
script: ls -al
|
||||
```
|
||||
|
||||
The above configures a:
|
||||
|
||||
- Shallow clone of 10, to speed up subsequent `git fetch` commands.
|
||||
- Custom clone path to make it possible to re-use worktrees between parent project and all forks
|
||||
because we use the same clone path for all forks.
|
||||
|
||||
Why use `$CI_CONCURRENT_ID`? The main reason is to ensure that worktrees used are not conflicting
|
||||
between projects. The `$CI_CONCURRENT_ID` represents a unique identifier within the given executor,
|
||||
so as long as we use it to construct the path, it is guaranteed that this directory will not conflict
|
||||
with other concurrent jobs running.
|
||||
|
||||
### Store custom clone options in `config.toml`
|
||||
|
||||
Ideally, all job-related configuration should be stored in `.gitlab-ci.yml`.
|
||||
However, sometimes it is desirable to make these schemes part of Runner configuration.
|
||||
|
||||
In the above example of Forks, making this configuration discoverable for users may be preferred,
|
||||
but this brings administrative overhead as the `.gitlab-ci.yml` needs to be updated for each branch.
|
||||
In such cases, it might be desirable to keep the `.gitlab-ci.yml` clone path agnostic, but make it
|
||||
a configuration of Runner.
|
||||
|
||||
We can extend our [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html)
|
||||
with the following specification that will be used by Runner if `.gitlab-ci.yml` will not override it:
|
||||
|
||||
```toml
|
||||
concurrent = 4
|
||||
|
||||
[[runners]]
|
||||
url = "GITLAB_URL"
|
||||
token = "TOKEN"
|
||||
executor = "docker"
|
||||
builds_dir = "/builds"
|
||||
cache_dir = "/cache"
|
||||
|
||||
environment = [
|
||||
"GIT_DEPTH=10",
|
||||
"GIT_CLONE_PATH=$CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME"
|
||||
]
|
||||
|
||||
[runners.docker]
|
||||
volumes = ["/builds:/builds", "/cache:/cache"]
|
||||
```
|
||||
|
||||
This makes the cloning configuration to be part of given Runner,
|
||||
and does not require us to update each `.gitlab-ci.yml`.
|
Loading…
Reference in a new issue