Add usefull tips about big repositories
This commit is contained in:
parent
50cd5d9b77
commit
c93779accb
2 changed files with 236 additions and 0 deletions
|
@ -62,6 +62,7 @@ into more features:
|
||||||
| [ChatOps](chatops/README.md) | Trigger CI jobs from chat, with results sent back to the channel. |
|
| [ChatOps](chatops/README.md) | Trigger CI jobs from chat, with results sent back to the channel. |
|
||||||
| [Interactive web terminals](interactive_web_terminal/index.md) | Open an interactive web terminal to debug the running jobs. |
|
| [Interactive web terminals](interactive_web_terminal/index.md) | Open an interactive web terminal to debug the running jobs. |
|
||||||
| [Review Apps](review_apps/index.md) | Configure GitLab CI/CD to preview code changes in a per-branch basis. |
|
| [Review Apps](review_apps/index.md) | Configure GitLab CI/CD to preview code changes in a per-branch basis. |
|
||||||
|
| [Optimising GitLab for large repositories](large_repositories/index.md) | Useful tips on how to optimise GitLab and GitLab Runner for big repositories. |
|
||||||
| [Deploy Boards](https://docs.gitlab.com/ee/user/project/deploy_boards.html) **[PREMIUM]** | Check the current health and status of each CI/CD environment running on Kubernetes. |
|
| [Deploy Boards](https://docs.gitlab.com/ee/user/project/deploy_boards.html) **[PREMIUM]** | Check the current health and status of each CI/CD environment running on Kubernetes. |
|
||||||
| [GitLab CI/CD for external repositories](https://docs.gitlab.com/ee/ci/ci_cd_for_external_repos/index.html) **[PREMIUM]** | Get the benefits of GitLab CI/CD combined with repositories in GitHub and BitBucket Cloud. |
|
| [GitLab CI/CD for external repositories](https://docs.gitlab.com/ee/ci/ci_cd_for_external_repos/index.html) **[PREMIUM]** | Get the benefits of GitLab CI/CD combined with repositories in GitHub and BitBucket Cloud. |
|
||||||
|
|
||||||
|
|
235
doc/ci/large_repositories/index.md
Normal file
235
doc/ci/large_repositories/index.md
Normal file
|
@ -0,0 +1,235 @@
|
||||||
|
# Optimising GitLab for large repositories
|
||||||
|
|
||||||
|
Large repositories consisting of more than 50k files in a worktree
|
||||||
|
often require special consideration because of
|
||||||
|
the time required to clone and check out.
|
||||||
|
|
||||||
|
GitLab and GitLab Runner handle this scenario well
|
||||||
|
but require optimised configuration to efficiently perform its
|
||||||
|
set of operations.
|
||||||
|
|
||||||
|
The general guidelines for handling big repositories are simple.
|
||||||
|
Each guideline is described in more detail in the sections below:
|
||||||
|
|
||||||
|
- Always fetch incrementally. Do not clone in a way that results in recreating all of the worktree.
|
||||||
|
- Always use shallow clone to reduce data transfer. Be aware that this puts more burden
|
||||||
|
on GitLab instance due to higher CPU impact.
|
||||||
|
- Control the clone directory if you heavily use a fork-based workflow.
|
||||||
|
- Optimise `git clean` flags to ensure that you remove or keep data that might affect or speed-up your build.
|
||||||
|
|
||||||
|
## Shallow cloning
|
||||||
|
|
||||||
|
> Introduced in GitLab Runner 8.9.
|
||||||
|
|
||||||
|
GitLab and GitLab Runner always perform a full clone by default.
|
||||||
|
While it means that all changes from GitLab are received,
|
||||||
|
it often results in receiving extra commit logs.
|
||||||
|
|
||||||
|
Ideally, you should always use `GIT_DEPTH` with a small number
|
||||||
|
like 10. This will instruct GitLab Runner to perform shallow clones.
|
||||||
|
Shallow clones makes Git request only the latest set of changes for a given branch,
|
||||||
|
up to desired number of commits as defined by the `GIT_DEPTH` variable.
|
||||||
|
|
||||||
|
This significantly speeds up fetching of changes from Git repositories,
|
||||||
|
especially if the repository has a very long backlog consisting of number
|
||||||
|
of big files as we effectively reduce amount of data transfer.
|
||||||
|
|
||||||
|
The following example makes GitLab Runner shallow clone to fetch only a given branch,
|
||||||
|
it does not fetch any other branches nor tags.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
variables:
|
||||||
|
GIT_DEPTH: 10
|
||||||
|
|
||||||
|
test:
|
||||||
|
script:
|
||||||
|
- ls -al
|
||||||
|
```
|
||||||
|
|
||||||
|
## Git strategy
|
||||||
|
|
||||||
|
> Introduced in GitLab Runner 8.9.
|
||||||
|
|
||||||
|
By default, GitLab is configured to always prefer the `GIT_STRATEGY: fetch` strategy.
|
||||||
|
The `GIT_STRATEGY: fetch` strategy will re-use existing worktrees if found
|
||||||
|
on disk. This is different to the `GIT_STRATEGY: clone` strategy
|
||||||
|
as in case of clones, if a worktree is found, it is removed before clone.
|
||||||
|
|
||||||
|
Usage of `fetch` is preferred because it reduces the amount of data to transfer and
|
||||||
|
does not really impact the operations that you might do on a repository from CI.
|
||||||
|
|
||||||
|
However, `fetch` does require access to the previous worktree. This works
|
||||||
|
well when using the `shell` or `docker` executor because these
|
||||||
|
try to preserve worktrees and try to re-use them by default.
|
||||||
|
|
||||||
|
This does not work today for `kubernetes` executor and has limitations when using
|
||||||
|
`docker+machine`. `kubernetes` executor today always clones into ephemeral directory.
|
||||||
|
|
||||||
|
GitLab also offers the `GIT_STRATEGY: none` strategy. This disables any `fetch` and `checkout` commands
|
||||||
|
done by GitLab, requiring you to do them.
|
||||||
|
|
||||||
|
## Git clone path
|
||||||
|
|
||||||
|
> Introduced in GitLab Runner 11.10.
|
||||||
|
|
||||||
|
`GIT_CLONE_PATH` allows you to control where you clone your sources.
|
||||||
|
This can have implications if you heavily use big repositories with fork workflow.
|
||||||
|
|
||||||
|
Fork workflow from GitLab Runner's perspective is stored as a separate repository
|
||||||
|
with separate worktree. That means that GitLab Runner cannot optimise the usage
|
||||||
|
of worktrees and you might have to instruct GitLab Runner to use that.
|
||||||
|
|
||||||
|
In such cases, ideally you want to make the GitLab Runner executor be used only used only
|
||||||
|
for the given project and not shared across different projects to make this
|
||||||
|
process more efficient.
|
||||||
|
|
||||||
|
The `GIT_CLONE_PATH` has to be within the `$CI_BUILDS_DIR`. Currently,
|
||||||
|
it is impossible to pick any path from disk.
|
||||||
|
|
||||||
|
## Git clean flags
|
||||||
|
|
||||||
|
> Introduced in GitLab Runner 11.10.
|
||||||
|
|
||||||
|
`GIT_CLEAN_FLAGS` allows you to control whether or not you require
|
||||||
|
the `git clean` command to be executed for each CI job.
|
||||||
|
By default, GitLab ensures that you have your worktree on the given SHA,
|
||||||
|
and that your repository is clean.
|
||||||
|
|
||||||
|
`GIT_CLEAN_FLAGS` is disabled when set to `none`. On very big repositories, this
|
||||||
|
might be desired because `git clean` is disk I/O intensive. Controlling that
|
||||||
|
with `GIT_CLEAN_FLAGS: -ffdx -e .build/`, for example, allows you to control and
|
||||||
|
disable removal of some directories within the worktree between subsequent runs,
|
||||||
|
which can speed-up the incremental builds. This has the biggest effect
|
||||||
|
if you re-use existing machines, and have an existing worktree that you can re-use
|
||||||
|
for builds.
|
||||||
|
|
||||||
|
For exact parameters accepted by `GIT_CLEAN_FLAGS`, see the documentation
|
||||||
|
for [git clean](https://git-scm.com/docs/git-clean). The
|
||||||
|
available parameters are dependent on Git version.
|
||||||
|
|
||||||
|
## Fork-based workflow
|
||||||
|
|
||||||
|
> Introduced in GitLab Runner 11.10.
|
||||||
|
|
||||||
|
Following the guidelines above, lets imagine that we want to:
|
||||||
|
|
||||||
|
- Optimise for a big project (more than 50k files in directory).
|
||||||
|
- Use forks-based workflow for contributing.
|
||||||
|
- Reuse existing worktrees. Have preconfigured runners that are pre-cloned with repositories.
|
||||||
|
- Runner assigned only to project and all forks.
|
||||||
|
|
||||||
|
Lets consider the following two examples, one using `shell` executor and
|
||||||
|
other using `docker` executor.
|
||||||
|
|
||||||
|
### `shell` executor example
|
||||||
|
|
||||||
|
Lets assume that you have the following [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html).
|
||||||
|
|
||||||
|
```toml
|
||||||
|
concurrent = 4
|
||||||
|
|
||||||
|
[[runners]]
|
||||||
|
url = "GITLAB_URL"
|
||||||
|
token = "TOKEN"
|
||||||
|
executor = "shell"
|
||||||
|
builds_dir = "/builds"
|
||||||
|
cache_dir = "/cache"
|
||||||
|
|
||||||
|
[runners.custom_build_dir]
|
||||||
|
enabled = true
|
||||||
|
```
|
||||||
|
|
||||||
|
This `config.toml`:
|
||||||
|
|
||||||
|
- Uses the `shell` executor,
|
||||||
|
- Specifies a custom `/builds` directory where all clones will be stored.
|
||||||
|
- Enables the ability to specify `GIT_CLONE_PATH`,
|
||||||
|
- Runs at most 4 jobs at once.
|
||||||
|
|
||||||
|
### `docker` executor example
|
||||||
|
|
||||||
|
Lets assume that you have the following [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html).
|
||||||
|
|
||||||
|
```toml
|
||||||
|
concurrent = 4
|
||||||
|
|
||||||
|
[[runners]]
|
||||||
|
url = "GITLAB_URL"
|
||||||
|
token = "TOKEN"
|
||||||
|
executor = "docker"
|
||||||
|
builds_dir = "/builds"
|
||||||
|
cache_dir = "/cache"
|
||||||
|
|
||||||
|
[runners.docker]
|
||||||
|
volumes = ["/builds:/builds", "/cache:/cache"]
|
||||||
|
```
|
||||||
|
|
||||||
|
This `config.toml`:
|
||||||
|
|
||||||
|
- Uses the `docker` executor,
|
||||||
|
- Specifies a custom `/builds` directory on disk where all clones will be stored.
|
||||||
|
We host mount the `/builds` directory to make it reusable between subsequent runs
|
||||||
|
and be allowed to override the cloning strategy.
|
||||||
|
- Doesn't enable the ability to specify `GIT_CLONE_PATH` as it is enabled by default.
|
||||||
|
- Runs at most 4 jobs at once.
|
||||||
|
|
||||||
|
### Our `.gitlab-ci.yml`
|
||||||
|
|
||||||
|
Once we have the executor configured, we need to fine tune our `.gitlab-ci.yml`.
|
||||||
|
|
||||||
|
Our pipeline will be most performant if we use the following `.gitlab-ci.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
variables:
|
||||||
|
GIT_DEPTH: 10
|
||||||
|
GIT_CLONE_PATH: $CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME
|
||||||
|
|
||||||
|
build:
|
||||||
|
script: ls -al
|
||||||
|
```
|
||||||
|
|
||||||
|
The above configures a:
|
||||||
|
|
||||||
|
- Shallow clone of 10, to speed up subsequent `git fetch` commands.
|
||||||
|
- Custom clone path to make it possible to re-use worktrees between parent project and all forks
|
||||||
|
because we use the same clone path for all forks.
|
||||||
|
|
||||||
|
Why use `$CI_CONCURRENT_ID`? The main reason is to ensure that worktrees used are not conflicting
|
||||||
|
between projects. The `$CI_CONCURRENT_ID` represents a unique identifier within the given executor,
|
||||||
|
so as long as we use it to construct the path, it is guaranteed that this directory will not conflict
|
||||||
|
with other concurrent jobs running.
|
||||||
|
|
||||||
|
### Store custom clone options in `config.toml`
|
||||||
|
|
||||||
|
Ideally, all job-related configuration should be stored in `.gitlab-ci.yml`.
|
||||||
|
However, sometimes it is desirable to make these schemes part of Runner configuration.
|
||||||
|
|
||||||
|
In the above example of Forks, making this configuration discoverable for users may be preferred,
|
||||||
|
but this brings administrative overhead as the `.gitlab-ci.yml` needs to be updated for each branch.
|
||||||
|
In such cases, it might be desirable to keep the `.gitlab-ci.yml` clone path agnostic, but make it
|
||||||
|
a configuration of Runner.
|
||||||
|
|
||||||
|
We can extend our [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html)
|
||||||
|
with the following specification that will be used by Runner if `.gitlab-ci.yml` will not override it:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
concurrent = 4
|
||||||
|
|
||||||
|
[[runners]]
|
||||||
|
url = "GITLAB_URL"
|
||||||
|
token = "TOKEN"
|
||||||
|
executor = "docker"
|
||||||
|
builds_dir = "/builds"
|
||||||
|
cache_dir = "/cache"
|
||||||
|
|
||||||
|
environment = [
|
||||||
|
"GIT_DEPTH=10",
|
||||||
|
"GIT_CLONE_PATH=$CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME"
|
||||||
|
]
|
||||||
|
|
||||||
|
[runners.docker]
|
||||||
|
volumes = ["/builds:/builds", "/cache:/cache"]
|
||||||
|
```
|
||||||
|
|
||||||
|
This makes the cloning configuration to be part of given Runner,
|
||||||
|
and does not require us to update each `.gitlab-ci.yml`.
|
Loading…
Reference in a new issue