Docs: Merge EE doc/development to CE
This commit is contained in:
parent
6f54ced40d
commit
0207468401
|
@ -38,6 +38,7 @@ description: 'Learn how to contribute to GitLab.'
|
||||||
- [Sidekiq guidelines](sidekiq_style_guide.md) for working with Sidekiq workers
|
- [Sidekiq guidelines](sidekiq_style_guide.md) for working with Sidekiq workers
|
||||||
- [Working with Gitaly](gitaly.md)
|
- [Working with Gitaly](gitaly.md)
|
||||||
- [Manage feature flags](feature_flags.md)
|
- [Manage feature flags](feature_flags.md)
|
||||||
|
- [Licensed feature availability](licensed_feature_availability.md)
|
||||||
- [View sent emails or preview mailers](emails.md)
|
- [View sent emails or preview mailers](emails.md)
|
||||||
- [Shell commands](shell_commands.md) in the GitLab codebase
|
- [Shell commands](shell_commands.md) in the GitLab codebase
|
||||||
- [`Gemfile` guidelines](gemfile.md)
|
- [`Gemfile` guidelines](gemfile.md)
|
||||||
|
@ -48,6 +49,7 @@ description: 'Learn how to contribute to GitLab.'
|
||||||
- [How to dump production data to staging](db_dump.md)
|
- [How to dump production data to staging](db_dump.md)
|
||||||
- [Working with the GitHub importer](github_importer.md)
|
- [Working with the GitHub importer](github_importer.md)
|
||||||
- [Import/Export development documentation](import_export.md)
|
- [Import/Export development documentation](import_export.md)
|
||||||
|
- [Elasticsearch integration docs](elasticsearch.md)
|
||||||
- [Working with Merge Request diffs](diffs.md)
|
- [Working with Merge Request diffs](diffs.md)
|
||||||
- [Kubernetes integration guidelines](kubernetes.md)
|
- [Kubernetes integration guidelines](kubernetes.md)
|
||||||
- [Permissions](permissions.md)
|
- [Permissions](permissions.md)
|
||||||
|
@ -55,6 +57,7 @@ description: 'Learn how to contribute to GitLab.'
|
||||||
- [Guidelines for reusing abstractions](reusing_abstractions.md)
|
- [Guidelines for reusing abstractions](reusing_abstractions.md)
|
||||||
- [DeclarativePolicy framework](policies.md)
|
- [DeclarativePolicy framework](policies.md)
|
||||||
- [How Git object deduplication works in GitLab](git_object_deduplication.md)
|
- [How Git object deduplication works in GitLab](git_object_deduplication.md)
|
||||||
|
- [Geo development](geo.md)
|
||||||
|
|
||||||
## Performance guides
|
## Performance guides
|
||||||
|
|
||||||
|
|
|
@ -155,7 +155,7 @@ the contribution acceptance criteria below:
|
||||||
restarting the failing CI job, rebasing from master to bring in updates that
|
restarting the failing CI job, rebasing from master to bring in updates that
|
||||||
may resolve the failure, or if it has not been fixed yet, ask a developer to
|
may resolve the failure, or if it has not been fixed yet, ask a developer to
|
||||||
help you fix the test.
|
help you fix the test.
|
||||||
1. The MR initially contains a a few logically organized commits.
|
1. The MR initially contains a few logically organized commits.
|
||||||
1. The changes can merge without problems. If not, you should rebase if you're the
|
1. The changes can merge without problems. If not, you should rebase if you're the
|
||||||
only one working on your feature branch, otherwise merge `master`.
|
only one working on your feature branch, otherwise merge `master`.
|
||||||
1. Only one specific issue is fixed or one specific feature is implemented. Do not
|
1. Only one specific issue is fixed or one specific feature is implemented. Do not
|
||||||
|
|
|
@ -0,0 +1,166 @@
|
||||||
|
# Elasticsearch knowledge **[STARTER ONLY]**
|
||||||
|
|
||||||
|
This area is to maintain a compendium of useful information when working with elasticsearch.
|
||||||
|
|
||||||
|
Information on how to enable ElasticSearch and perform the initial indexing is kept in https://docs.gitlab.com/ee/integration/elasticsearch.html#enabling-elasticsearch
|
||||||
|
|
||||||
|
## Initial installation on OS X
|
||||||
|
|
||||||
|
It is recommended to use the Docker image. After installing docker you can immediately spin up an instance with
|
||||||
|
|
||||||
|
```
|
||||||
|
docker run --name elastic56 -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:5.6.12
|
||||||
|
```
|
||||||
|
|
||||||
|
and use `docker stop elastic56` and `docker start elastic56` to stop/start it.
|
||||||
|
|
||||||
|
### Installing on the host
|
||||||
|
|
||||||
|
We currently only support Elasticsearch [5.6 to 6.x](https://docs.gitlab.com/ee/integration/elasticsearch.html#requirements)
|
||||||
|
|
||||||
|
Version 5.6 is available on homebrew and is the recommended version to use in order to test compatibility.
|
||||||
|
|
||||||
|
```
|
||||||
|
brew install elasticsearch@5.6
|
||||||
|
```
|
||||||
|
|
||||||
|
There is no need to install any plugins
|
||||||
|
|
||||||
|
## New repo indexer (beta)
|
||||||
|
|
||||||
|
If you're interested on working with the new beta repo indexer, all you need to do is:
|
||||||
|
|
||||||
|
- git clone git@gitlab.com:gitlab-org/gitlab-elasticsearch-indexer.git
|
||||||
|
- make
|
||||||
|
- make install
|
||||||
|
|
||||||
|
this adds `gitlab-elasticsearch-indexer` to `$GOPATH/bin`, please make sure that is in your `$PATH`. After that GitLab will find it and you'll be able to enable it in the admin settings area.
|
||||||
|
|
||||||
|
**note:** `make` will not recompile the executable unless you do `make clean` beforehand
|
||||||
|
|
||||||
|
## Helpful rake tasks
|
||||||
|
|
||||||
|
- `gitlab:elastic:test:index_size`: Tells you how much space the current index is using, as well as how many documents are in the index.
|
||||||
|
- `gitlab:elastic:test:index_size_change`: Outputs index size, reindexes, and outputs index size again. Useful when testing improvements to indexing size.
|
||||||
|
|
||||||
|
Additionally, if you need large repos or multiple forks for testing, please consider [following these instructions](https://docs.gitlab.com/ee/development/rake_tasks.html#extra-project-seed-options)
|
||||||
|
|
||||||
|
## How does it work?
|
||||||
|
|
||||||
|
The ElasticSearch integration depends on an external indexer. We ship a [ruby indexer](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/bin/elastic_repo_indexer) by default but are also working on an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a rake task, but after this is done GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [/ee/app/models/concerns/elastic/application_search.rb](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/app/models/concerns/elastic/application_search.rb).
|
||||||
|
|
||||||
|
All indexing after the initial one is done via `ElasticIndexerWorker` (sidekiq jobs).
|
||||||
|
|
||||||
|
Search queries are generated by the concerns found in [ee/app/models/concerns/elastic](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them!
|
||||||
|
|
||||||
|
## Existing Analyzers/Tokenizers/Filters
|
||||||
|
These are all defined in https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/elasticsearch/git/model.rb
|
||||||
|
|
||||||
|
### Analyzers
|
||||||
|
#### `path_analyzer`
|
||||||
|
Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters.
|
||||||
|
|
||||||
|
Please see the `path_tokenizer` explanation below for an example.
|
||||||
|
|
||||||
|
#### `sha_analyzer`
|
||||||
|
Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters.
|
||||||
|
|
||||||
|
Please see the `sha_tokenizer` explanation later below for an example.
|
||||||
|
|
||||||
|
#### `code_analyzer`
|
||||||
|
Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: `code`, `edgeNGram_filter`, `lowercase`, and `asciifolding`
|
||||||
|
|
||||||
|
The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched.
|
||||||
|
|
||||||
|
Please see the `code` filter for an explanation on how tokens are split.
|
||||||
|
|
||||||
|
#### `code_search_analyzer`
|
||||||
|
Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.
|
||||||
|
|
||||||
|
### Tokenizers
|
||||||
|
#### `sha_tokenizer`
|
||||||
|
This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searcheable by any sub-set of it (minimum of 5 chars).
|
||||||
|
|
||||||
|
example:
|
||||||
|
|
||||||
|
`240c29dc7e` becomes:
|
||||||
|
- `240c2`
|
||||||
|
- `240c29`
|
||||||
|
- `240c29d`
|
||||||
|
- `240c29dc`
|
||||||
|
- `240c29dc7`
|
||||||
|
- `240c29dc7e`
|
||||||
|
|
||||||
|
#### `path_tokenizer`
|
||||||
|
This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` in order to allow searches to find paths no matter how much or how little of the path is given as input.
|
||||||
|
|
||||||
|
example:
|
||||||
|
|
||||||
|
`'/some/path/application.js'` becomes:
|
||||||
|
- `'/some/path/application.js'`
|
||||||
|
- `'some/path/application.js'`
|
||||||
|
- `'path/application.js'`
|
||||||
|
- `'application.js'`
|
||||||
|
|
||||||
|
### Filters
|
||||||
|
#### `code`
|
||||||
|
Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves.
|
||||||
|
|
||||||
|
Patterns:
|
||||||
|
- `"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCased and lowedCameCased strings as separate tokens
|
||||||
|
- `"(\\d+)"`: extracts digits
|
||||||
|
- `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCased strings recursively. Ex: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]`
|
||||||
|
- `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
|
||||||
|
- `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
|
||||||
|
- `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
|
||||||
|
- `'\/?([^\/]+)(?=\/|\b)'`: separate path terms `like/this/one`
|
||||||
|
|
||||||
|
#### `edgeNGram_filter`
|
||||||
|
Uses an [Edge NGram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenfilter.html) to allow inputs with only parts of a token to find the token. For example it would turn `glasses` into permutations starting with `gl` and ending with `glasses`, which would allow a search for "`glass`" to find the original token `glasses`
|
||||||
|
|
||||||
|
## Gotchas
|
||||||
|
|
||||||
|
- Searches can have their own analyzers. Remember to check when editing analyzers
|
||||||
|
- `Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Getting "flood stage disk watermark [95%] exceeded"
|
||||||
|
|
||||||
|
You might get an error such as
|
||||||
|
|
||||||
|
```
|
||||||
|
[2018-10-31T15:54:19,762][WARN ][o.e.c.r.a.DiskThresholdMonitor] [pval5Ct]
|
||||||
|
flood stage disk watermark [95%] exceeded on
|
||||||
|
[pval5Ct7SieH90t5MykM5w][pval5Ct][/usr/local/var/lib/elasticsearch/nodes/0] free: 56.2gb[3%],
|
||||||
|
all indices on this node will be marked read-only
|
||||||
|
```
|
||||||
|
|
||||||
|
This is because you've exceeded the disk space threshold - it thinks you don't have enough disk space left, based on the default 95% threshold.
|
||||||
|
|
||||||
|
In addition, the `read_only_allow_delete` setting will be set to `true`. It will block indexing, `forcemerge`, etc
|
||||||
|
|
||||||
|
```
|
||||||
|
curl "http://localhost:9200/gitlab-development/_settings?pretty"
|
||||||
|
```
|
||||||
|
|
||||||
|
Add this to your `elasticsearch.yml` file:
|
||||||
|
|
||||||
|
```
|
||||||
|
# turn off the disk allocator
|
||||||
|
cluster.routing.allocation.disk.threshold_enabled: false
|
||||||
|
```
|
||||||
|
|
||||||
|
_or_
|
||||||
|
|
||||||
|
```
|
||||||
|
# set your own limits
|
||||||
|
cluster.routing.allocation.disk.threshold_enabled: true
|
||||||
|
cluster.routing.allocation.disk.watermark.flood_stage: 5gb # ES 6.x only
|
||||||
|
cluster.routing.allocation.disk.watermark.low: 15gb
|
||||||
|
cluster.routing.allocation.disk.watermark.high: 10gb
|
||||||
|
```
|
||||||
|
|
||||||
|
Restart ElasticSearch, and the `read_only_allow_delete` will clear on it's own.
|
||||||
|
|
||||||
|
_from "Disk-based Shard Allocation | Elasticsearch Reference" [5.6](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/disk-allocator.html#disk-allocator) and [6.x](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/disk-allocator.html)_
|
|
@ -16,10 +16,12 @@ New utility classes should be added to [`utilities.scss`](https://gitlab.com/git
|
||||||
|
|
||||||
**Background color**: `.bg-variant-shade` e.g. `.bg-warning-400`
|
**Background color**: `.bg-variant-shade` e.g. `.bg-warning-400`
|
||||||
**Text color**: `.text-variant-shade` e.g. `.text-success-500`
|
**Text color**: `.text-variant-shade` e.g. `.text-success-500`
|
||||||
|
|
||||||
- variant is one of 'primary', 'secondary', 'success', 'warning', 'error'
|
- variant is one of 'primary', 'secondary', 'success', 'warning', 'error'
|
||||||
- shade is on of the shades listed on [colors](https://design.gitlab.com/foundations/colors/)
|
- shade is on of the shades listed on [colors](https://design.gitlab.com/foundations/colors/)
|
||||||
|
|
||||||
**Font size**: `.text-size` e.g. `.text-2`
|
**Font size**: `.text-size` e.g. `.text-2`
|
||||||
|
|
||||||
- **size** is number from 1-6 from our [Type scale](https://design.gitlab.com/foundations/typography)
|
- **size** is number from 1-6 from our [Type scale](https://design.gitlab.com/foundations/typography)
|
||||||
|
|
||||||
### Naming
|
### Naming
|
||||||
|
|
|
@ -0,0 +1,417 @@
|
||||||
|
# Geo (development) **[PREMIUM ONLY]**
|
||||||
|
|
||||||
|
Geo connects GitLab instances together. One GitLab instance is
|
||||||
|
designated as a **primary** node and can be run with multiple
|
||||||
|
**secondary** nodes. Geo orchestrates quite a few components that are
|
||||||
|
described in more detail below.
|
||||||
|
|
||||||
|
## Database replication
|
||||||
|
|
||||||
|
Geo uses [streaming replication](#streaming-replication) to replicate
|
||||||
|
the database from the **primary** to the **secondary** nodes. This
|
||||||
|
replication gives the **secondary** nodes access to all the data saved
|
||||||
|
in the database. So users can log in on the **secondary** and read all
|
||||||
|
the issues, merge requests, etc. on the **secondary** node.
|
||||||
|
|
||||||
|
## Repository replication
|
||||||
|
|
||||||
|
Geo also replicates repositories. Each **secondary** node keeps track of
|
||||||
|
the state of every repository in the [tracking database](#tracking-database).
|
||||||
|
|
||||||
|
There are a few ways a repository gets replicated by the:
|
||||||
|
|
||||||
|
- [Repository Sync worker](#repository-sync-worker).
|
||||||
|
- [Geo Log Cursor](#geo-log-cursor).
|
||||||
|
|
||||||
|
### Project Registry
|
||||||
|
|
||||||
|
The `Geo::ProjectRegistry` class defines the model used to track the
|
||||||
|
state of repository replication. For each project in the main
|
||||||
|
database, one record in the tracking database is kept.
|
||||||
|
|
||||||
|
It records the following about repositories:
|
||||||
|
|
||||||
|
- The last time they were synced.
|
||||||
|
- The last time they were synced successfully.
|
||||||
|
- If they need to be resynced.
|
||||||
|
- When retry should be attempted.
|
||||||
|
- The number of retries.
|
||||||
|
- If and when the they were verified.
|
||||||
|
|
||||||
|
It also stores these attributes for project wikis in dedicated columns.
|
||||||
|
|
||||||
|
### Repository Sync worker
|
||||||
|
|
||||||
|
The `Geo::RepositorySyncWorker` class runs periodically in the
|
||||||
|
background and it searches the `Geo::ProjectRegistry` model for
|
||||||
|
projects that need updating. Those projects can be:
|
||||||
|
|
||||||
|
- Unsynced: Projects that have never been synced on the **secondary**
|
||||||
|
node and so do not exist yet.
|
||||||
|
- Updated recently: Projects that have a `last_repository_updated_at`
|
||||||
|
timestamp that is more recent than the `last_repository_successful_sync_at`
|
||||||
|
timestamp in the `Geo::ProjectRegistry` model.
|
||||||
|
- Manual: The admin can manually flag a repository to resync in the
|
||||||
|
[Geo admin panel](https://docs.gitlab.com/ee/user/admin_area/geo_nodes.html).
|
||||||
|
|
||||||
|
When we fail to fetch a repository on the secondary `RETRIES_BEFORE_REDOWNLOAD`
|
||||||
|
times, Geo does a so-called _redownload_. It will do a clean clone
|
||||||
|
into the `@geo-temporary` directory in the root of the storage. When
|
||||||
|
it's successful, we replace the main repo with the newly cloned one.
|
||||||
|
|
||||||
|
### Geo Log Cursor
|
||||||
|
|
||||||
|
The [Geo Log Cursor](#geo-log-cursor) is a separate process running on
|
||||||
|
each **secondary** node. It monitors the [Geo Event Log](#geo-event-log)
|
||||||
|
and handles all of the events. When it sees an unhandled event, it
|
||||||
|
starts a background worker to handle that event, depending on the type
|
||||||
|
of event.
|
||||||
|
|
||||||
|
When a repository receives an update, the Geo **primary** node creates
|
||||||
|
a Geo event with an associated repository updated event. The cursor
|
||||||
|
picks that up, and schedules a `Geo::ProjectSyncWorker` job which will
|
||||||
|
use the `Geo::RepositorySyncService` class and `Geo::WikiSyncService`
|
||||||
|
class to update the repository and the wiki.
|
||||||
|
|
||||||
|
## Uploads replication
|
||||||
|
|
||||||
|
File uploads are also being replicated to the **secondary** node. To
|
||||||
|
track the state of syncing, the `Geo::FileRegistry` model is used.
|
||||||
|
|
||||||
|
### File Registry
|
||||||
|
|
||||||
|
Similar to the [Project Registry](#project-registry), there is a
|
||||||
|
`Geo::FileRegistry` model that tracks the synced uploads.
|
||||||
|
|
||||||
|
CI Job Artifacts are synced in a similar way as uploads or LFS
|
||||||
|
objects, but they are tracked by `Geo::JobArtifactRegistry` model.
|
||||||
|
|
||||||
|
### File Download Dispatch worker
|
||||||
|
|
||||||
|
Also similar to the [Repository Sync worker](#repository-sync-worker),
|
||||||
|
there is a `Geo::FileDownloadDispatchWorker` class that is run
|
||||||
|
periodically to sync all uploads that aren't synced to the Geo
|
||||||
|
**secondary** node yet.
|
||||||
|
|
||||||
|
Files are copied via HTTP(s) and initiated via the
|
||||||
|
`/api/v4/geo/transfers/:type/:id` endpoint,
|
||||||
|
e.g. `/api/v4/geo/transfers/lfs/123`.
|
||||||
|
|
||||||
|
## Authentication
|
||||||
|
|
||||||
|
To authenticate file transfers, each `GeoNode` record has two fields:
|
||||||
|
|
||||||
|
- A public access key (`access_key` field).
|
||||||
|
- A secret access key (`secret_access_key` field).
|
||||||
|
|
||||||
|
The **secondary** node authenticates itself via a [JWT request](https://jwt.io/).
|
||||||
|
When the **secondary** node wishes to download a file, it sends an
|
||||||
|
HTTP request with the `Authorization` header:
|
||||||
|
|
||||||
|
```
|
||||||
|
Authorization: GL-Geo <access_key>:<JWT payload>
|
||||||
|
```
|
||||||
|
|
||||||
|
The **primary** node uses the `access_key` field to look up the
|
||||||
|
corresponding Geo **secondary** node and decrypts the JWT payload,
|
||||||
|
which contains additional information to identify the file
|
||||||
|
request. This ensures that the **secondary** node downloads the right
|
||||||
|
file for the right database ID. For example, for an LFS object, the
|
||||||
|
request must also include the SHA256 sum of the file. An example JWT
|
||||||
|
payload looks like:
|
||||||
|
|
||||||
|
```
|
||||||
|
{ "data": { sha256: "31806bb23580caab78040f8c45d329f5016b0115" }, iat: "1234567890" }
|
||||||
|
```
|
||||||
|
|
||||||
|
If the requested file matches the requested SHA256 sum, then the Geo
|
||||||
|
**primary** node sends data via the [X-Sendfile](https://www.nginx.com/resources/wiki/start/topics/examples/xsendfile/)
|
||||||
|
feature, which allows NGINX to handle the file transfer without tying
|
||||||
|
up Rails or Workhorse.
|
||||||
|
|
||||||
|
NOTE: **Note:**
|
||||||
|
JWT requires synchronized clocks between the machines
|
||||||
|
involved, otherwise it may fail with an encryption error.
|
||||||
|
|
||||||
|
## Using the Tracking Database
|
||||||
|
|
||||||
|
Along with the main database that is replicated, a Geo **secondary**
|
||||||
|
node has its own separate [Tracking database](#tracking-database).
|
||||||
|
|
||||||
|
The tracking database contains the state of the **secondary** node.
|
||||||
|
|
||||||
|
Any database migration that needs to be run as part of an upgrade
|
||||||
|
needs to be applied to the tracking database on each **secondary** node.
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
The database configuration is set in [`config/database_geo.yml`](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/config/database_geo.yml.postgresql).
|
||||||
|
The directory [`ee/db/geo`](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/ee/db/geo)
|
||||||
|
contains the schema and migrations for this database.
|
||||||
|
|
||||||
|
To write a migration for the database, use the `GeoMigrationGenerator`:
|
||||||
|
|
||||||
|
```
|
||||||
|
rails g geo_migration [args] [options]
|
||||||
|
```
|
||||||
|
|
||||||
|
To migrate the tracking database, run:
|
||||||
|
|
||||||
|
```
|
||||||
|
bundle exec rake geo:db:migrate
|
||||||
|
```
|
||||||
|
|
||||||
|
### Foreign Data Wrapper
|
||||||
|
|
||||||
|
The use of [FDW](#fdw) was introduced in GitLab 10.1.
|
||||||
|
|
||||||
|
This is useful for the [Geo Log Cursor](#geo-log-cursor) and improves
|
||||||
|
the performance of some synchronization operations.
|
||||||
|
|
||||||
|
While FDW is available in older versions of PostgreSQL, we needed to
|
||||||
|
raise the minimum required version to 9.6 as this includes many
|
||||||
|
performance improvements to the FDW implementation.
|
||||||
|
|
||||||
|
#### Refeshing the Foreign Tables
|
||||||
|
|
||||||
|
Whenever the database schema changes on the **primary** node, the
|
||||||
|
**secondary** node will need to refresh its foreign tables by running
|
||||||
|
the following:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
bundle exec rake geo:db:refresh_foreign_tables
|
||||||
|
```
|
||||||
|
|
||||||
|
Failure to do this will prevent the **secondary** node from
|
||||||
|
functioning properly. The **secondary** node will generate error
|
||||||
|
messages, as the following PostgreSQL error:
|
||||||
|
|
||||||
|
```
|
||||||
|
ERROR: relation "gitlab_secondary.ci_job_artifacts" does not exist at character 323
|
||||||
|
STATEMENT: SELECT a.attname, format_type(a.atttypid, a.atttypmod),
|
||||||
|
pg_get_expr(d.adbin, d.adrelid), a.attnotnull, a.atttypid, a.atttypmod
|
||||||
|
FROM pg_attribute a LEFT JOIN pg_attrdef d
|
||||||
|
ON a.attrelid = d.adrelid AND a.attnum = d.adnum
|
||||||
|
WHERE a.attrelid = '"gitlab_secondary"."ci_job_artifacts"'::regclass
|
||||||
|
AND a.attnum > 0 AND NOT a.attisdropped
|
||||||
|
ORDER BY a.attnum
|
||||||
|
```
|
||||||
|
|
||||||
|
## Finders
|
||||||
|
|
||||||
|
Geo uses [Finders](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/app/finders),
|
||||||
|
which are classes take care of the heavy lifting of looking up
|
||||||
|
projects/attachments/etc. in the tracking database and main database.
|
||||||
|
|
||||||
|
### Finders Performance
|
||||||
|
|
||||||
|
The Finders need to compare data from the main database with data in
|
||||||
|
the tracking database. For example, counting the number of synced
|
||||||
|
projects normally involves retrieving the project IDs from one
|
||||||
|
database and checking their state in the other database. This is slow
|
||||||
|
and requires a lot of memory.
|
||||||
|
|
||||||
|
To overcome this, the Finders use [FDW](#fdw), or Foreign Data
|
||||||
|
Wrappers. This allows a regular `JOIN` between the main database and
|
||||||
|
the tracking database.
|
||||||
|
|
||||||
|
## Redis
|
||||||
|
|
||||||
|
Redis on the **secondary** node works the same as on the **primary**
|
||||||
|
node. It is used for caching, storing sessions, and other persistent
|
||||||
|
data.
|
||||||
|
|
||||||
|
Redis data replication between **primary** and **secondary** node is
|
||||||
|
not used, so sessions etc. aren't shared between nodes.
|
||||||
|
|
||||||
|
## Object Storage
|
||||||
|
|
||||||
|
GitLab can optionally use Object Storage to store data it would
|
||||||
|
otherwise store on disk. These things can be:
|
||||||
|
|
||||||
|
- LFS Objects
|
||||||
|
- CI Job Artifacts
|
||||||
|
- Uploads
|
||||||
|
|
||||||
|
Objects that are stored in object storage, are not handled by Geo. Geo
|
||||||
|
ignores items in object storage. Either:
|
||||||
|
|
||||||
|
- The object storage layer should take care of its own geographical
|
||||||
|
replication.
|
||||||
|
- All secondary nodes should use the same storage node.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
### Repository verification
|
||||||
|
|
||||||
|
Repositories are verified with a checksum.
|
||||||
|
|
||||||
|
The **primary** node calculates a checksum on the repository. It
|
||||||
|
basically hashes all Git refs together and stores that hash in the
|
||||||
|
`project_repository_states` table of the database.
|
||||||
|
|
||||||
|
The **secondary** node does the same to calculate the hash of its
|
||||||
|
clone, and compares the hash with the value the **primary** node
|
||||||
|
calculated. If there is a mismatch, Geo will mark this as a mismatch
|
||||||
|
and the administrator can see this in the [Geo admin panel](https://docs.gitlab.com/ee/user/admin_area/geo_nodes.html).
|
||||||
|
|
||||||
|
## Glossary
|
||||||
|
|
||||||
|
### Primary node
|
||||||
|
|
||||||
|
A **primary** node is the single node in a Geo setup that read-write
|
||||||
|
capabilities. It's the single source of truth and the Geo
|
||||||
|
**secondary** nodes replicate their data from there.
|
||||||
|
|
||||||
|
In a Geo setup, there can only be one **primary** node. All
|
||||||
|
**secondary** nodes connect to that **primary**.
|
||||||
|
|
||||||
|
### Secondary node
|
||||||
|
|
||||||
|
A **secondary** node is a read-only replica of the **primary** node
|
||||||
|
running in a different geographical location.
|
||||||
|
|
||||||
|
### Streaming replication
|
||||||
|
|
||||||
|
Geo depends on the streaming replication feature of PostgreSQL. It
|
||||||
|
completely replicates the database data and the database schema. The
|
||||||
|
database replica is a read-only copy.
|
||||||
|
|
||||||
|
Streaming replication depends on the Write Ahead Logs, or WAL. Those
|
||||||
|
logs are copied over to the replica and replayed there.
|
||||||
|
|
||||||
|
Since streaming replication also replicates the schema, the database
|
||||||
|
migration do not need to run on the secondary nodes.
|
||||||
|
|
||||||
|
### Tracking database
|
||||||
|
|
||||||
|
A database on each Geo **secondary** node that keeps state for the node
|
||||||
|
on which it resides. Read more in [Using the Tracking database](#using-the-tracking-database).
|
||||||
|
|
||||||
|
### FDW
|
||||||
|
|
||||||
|
Foreign Data Wrapper, or FDW, is a feature built-in in PostgreSQL. It
|
||||||
|
allows data to be queried from different data sources. In Geo, it's
|
||||||
|
used to query data from different PostgreSQL instances.
|
||||||
|
|
||||||
|
## Geo Event Log
|
||||||
|
|
||||||
|
The Geo **primary** stores events in the `geo_event_log` table. Each
|
||||||
|
entry in the log contains a specific type of event. These type of
|
||||||
|
events include:
|
||||||
|
|
||||||
|
- Repository Deleted event
|
||||||
|
- Repository Renamed event
|
||||||
|
- Repositories Changed event
|
||||||
|
- Repository Created event
|
||||||
|
- Hashed Storage Migrated event
|
||||||
|
- Lfs Object Deleted event
|
||||||
|
- Hashed Storage Attachments event
|
||||||
|
- Job Artifact Deleted event
|
||||||
|
- Upload Deleted event
|
||||||
|
|
||||||
|
### Geo Log Cursor
|
||||||
|
|
||||||
|
The process running on the **secondary** node that looks for new
|
||||||
|
`Geo::EventLog` rows.
|
||||||
|
|
||||||
|
## Code features
|
||||||
|
|
||||||
|
### `Gitlab::Geo` utilities
|
||||||
|
|
||||||
|
Small utility methods related to Geo go into the
|
||||||
|
[`ee/lib/gitlab/geo.rb`](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/gitlab/geo.rb)
|
||||||
|
file.
|
||||||
|
|
||||||
|
Many of these methods are cached using the `RequestStore` class, to
|
||||||
|
reduce the performance impact of using the methods throughout the
|
||||||
|
codebase.
|
||||||
|
|
||||||
|
#### Current node
|
||||||
|
|
||||||
|
The class method `.current_node` returns the `GeoNode` record for the
|
||||||
|
current node.
|
||||||
|
|
||||||
|
We use the `host`, `port`, and `relative_url_root` values from
|
||||||
|
`gitlab.yml` and search in the database to identify which node we are
|
||||||
|
in (see `GeoNode.current_node`).
|
||||||
|
|
||||||
|
#### Primary or secondary
|
||||||
|
|
||||||
|
To determine whether the current node is a **primary** node or a
|
||||||
|
**secondary** node use the `.primary?` and `.secondary?` class
|
||||||
|
methods.
|
||||||
|
|
||||||
|
It is possible for these methods to both return `false` on a node when
|
||||||
|
the node is not enabled. See [Enablement](#enablement).
|
||||||
|
|
||||||
|
#### Geo Database configured?
|
||||||
|
|
||||||
|
There is also an additional gotcha when dealing with things that
|
||||||
|
happen during initialization time. In a few places, we use the
|
||||||
|
`Gitlab::Geo.geo_database_configured?` method to check if the node has
|
||||||
|
the tracking database, which only exists on the **secondary**
|
||||||
|
node. This overcomes race conditions that could happen during
|
||||||
|
bootstrapping of a new node.
|
||||||
|
|
||||||
|
#### Enablement
|
||||||
|
|
||||||
|
We consider Geo feature enabled when the user has a valid license with the
|
||||||
|
feature included, and they have at least one node defined at the Geo Nodes
|
||||||
|
screen.
|
||||||
|
|
||||||
|
See `Gitlab::Geo.enabled?` and `Gitlab::Geo.license_allows?` methods.
|
||||||
|
|
||||||
|
#### Read-only
|
||||||
|
|
||||||
|
All Geo **secondary** nodes are read-only.
|
||||||
|
|
||||||
|
The general principle of a [read-only database](verifying_database_capabilities.md#read-only-database)
|
||||||
|
applies to all Geo **secondary** nodes. So the
|
||||||
|
`Gitlab::Database.read_only?` method will always return `true` on a
|
||||||
|
**secondary** node.
|
||||||
|
|
||||||
|
When some write actions are not allowed because the node is a
|
||||||
|
**secondary**, consider adding the `Gitlab::Database.read_only?` or
|
||||||
|
`Gitlab::Database.read_write?` guard, instead of `Gitlab::Geo.secondary?`.
|
||||||
|
|
||||||
|
The database itself will already be read-only in a replicated setup,
|
||||||
|
so we don't need to take any extra step for that.
|
||||||
|
|
||||||
|
## History of communication channel
|
||||||
|
|
||||||
|
The communication channel has changed since first iteration, you can
|
||||||
|
check here historic decisions and why we moved to new implementations.
|
||||||
|
|
||||||
|
### Custom code (GitLab 8.6 and earlier)
|
||||||
|
|
||||||
|
In GitLab versions before 8.6, custom code is used to handle
|
||||||
|
notification from **primary** node to **secondary** nodes by HTTP
|
||||||
|
requests.
|
||||||
|
|
||||||
|
### System hooks (GitLab 8.7 to 9.5)
|
||||||
|
|
||||||
|
Later, it was decided to move away from custom code and begin using
|
||||||
|
system hooks. More people were using them, so
|
||||||
|
many would benefit from improvements made to this communication layer.
|
||||||
|
|
||||||
|
There is a specific **internal** endpoint in our API code (Grape),
|
||||||
|
that receives all requests from this System Hooks:
|
||||||
|
`/api/v4/geo/receive_events`.
|
||||||
|
|
||||||
|
We switch and filter from each event by the `event_name` field.
|
||||||
|
|
||||||
|
### Geo Log Cursor (GitLab 10.0 and up)
|
||||||
|
|
||||||
|
Since GitLab 10.0, [System Webhooks](#system-hooks-gitlab-87-to-95) are no longer
|
||||||
|
used and Geo Log Cursor is used instead. The Log Cursor traverses the
|
||||||
|
`Geo::EventLog` rows to see if there are changes since the last time
|
||||||
|
the log was checked and will handle repository updates, deletes,
|
||||||
|
changes, and renames.
|
||||||
|
|
||||||
|
The table is within the replicated database. This has two advantages over the
|
||||||
|
old method:
|
||||||
|
|
||||||
|
- Replication is synchronous and we preserve the order of events.
|
||||||
|
- Replication of the events happen at the same time as the changes in the
|
||||||
|
database.
|
|
@ -93,7 +93,7 @@ become available, you will be able to share job templates like this
|
||||||
|
|
||||||
Dependencies should be kept to the minimum. The introduction of a new
|
Dependencies should be kept to the minimum. The introduction of a new
|
||||||
dependency should be argued in the merge request, as per our [Approval
|
dependency should be argued in the merge request, as per our [Approval
|
||||||
Guidelines](../code_review.html#approval-guidelines). Both [License
|
Guidelines](../code_review.md#approval-guidelines). Both [License
|
||||||
Management](https://docs.gitlab.com/ee/user/project/merge_requests/license_management.html)
|
Management](https://docs.gitlab.com/ee/user/project/merge_requests/license_management.html)
|
||||||
**[ULTIMATE]** and [Dependency
|
**[ULTIMATE]** and [Dependency
|
||||||
Scanning](https://docs.gitlab.com/ee/user/project/merge_requests/dependency_scanning.html)
|
Scanning](https://docs.gitlab.com/ee/user/project/merge_requests/dependency_scanning.html)
|
||||||
|
|
|
@ -0,0 +1,37 @@
|
||||||
|
# Licensed feature availability **[STARTER]**
|
||||||
|
|
||||||
|
As of GitLab 9.4, we've been supporting a simplified version of licensed
|
||||||
|
feature availability checks via `ee/app/models/license.rb`, both for
|
||||||
|
on-premise or GitLab.com plans and features.
|
||||||
|
|
||||||
|
## Restricting features scoped by namespaces or projects
|
||||||
|
|
||||||
|
GitLab.com plans are persisted on user groups and namespaces, therefore, if you're adding a
|
||||||
|
feature such as [Related issues](https://docs.gitlab.com/ee/user/project/issues/related_issues.html) or
|
||||||
|
[Service desk](https://docs.gitlab.com/ee/user/project/service_desk.html),
|
||||||
|
it should be restricted on namespace scope.
|
||||||
|
|
||||||
|
1. Add the feature symbol on `EES_FEATURES`, `EEP_FEATURES` or `EEU_FEATURES` constants in
|
||||||
|
`ee/app/models/license.rb`. Note on `ee/app/models/ee/namespace.rb` that _Bronze_ GitLab.com
|
||||||
|
features maps to on-premise _EES_, _Silver_ to _EEP_ and _Gold_ to _EEU_.
|
||||||
|
2. Check using:
|
||||||
|
|
||||||
|
```ruby
|
||||||
|
project.feature_available?(:feature_symbol)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Restricting global features (instance)
|
||||||
|
|
||||||
|
However, for features such as [Geo](https://docs.gitlab.com/ee/administration/geo/replication/index.html) and
|
||||||
|
[Load balancing](https://docs.gitlab.com/ee/administration/database_load_balancing.html), which cannot be restricted
|
||||||
|
to only a subset of projects or namespaces, the check will be made directly in
|
||||||
|
the instance license.
|
||||||
|
|
||||||
|
1. Add the feature symbol on `EES_FEATURES`, `EEP_FEATURES` or `EEU_FEATURES` constants in
|
||||||
|
`ee/app/models/license.rb`.
|
||||||
|
2. Add the same feature symbol to `GLOBAL_FEATURES`
|
||||||
|
3. Check using:
|
||||||
|
|
||||||
|
```ruby
|
||||||
|
License.feature_available?(:feature_symbol)
|
||||||
|
```
|
|
@ -0,0 +1,68 @@
|
||||||
|
# Packages **[PREMIUM]**
|
||||||
|
|
||||||
|
This document will guide you through adding another [package management system](https://docs.gitlab.com/ee/administration/packages.html) support to GitLab.
|
||||||
|
|
||||||
|
See already supported package types in [Packages documentation](https://docs.gitlab.com/ee/administration/packages.html)
|
||||||
|
|
||||||
|
Since GitLab packages' UI is pretty generic, it is possible to add new
|
||||||
|
package system support by solely backend changes. This guide is superficial and does
|
||||||
|
not cover the way the code should be written. However, you can find a good example
|
||||||
|
by looking at existing merge requests with Maven and NPM support:
|
||||||
|
|
||||||
|
- [NPM registry support](https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/8673).
|
||||||
|
- [Maven repository](https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/6607).
|
||||||
|
- [Instance level endpoint for Maven repository](https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/8757)
|
||||||
|
|
||||||
|
## General information
|
||||||
|
|
||||||
|
The existing database model requires the following:
|
||||||
|
|
||||||
|
- Every package belongs to a project.
|
||||||
|
- Every package file belongs to a package.
|
||||||
|
- A package can have one or more package files.
|
||||||
|
- The package model is based on storing information about the package and its version.
|
||||||
|
|
||||||
|
## API endpoints
|
||||||
|
|
||||||
|
Package systems work with GitLab via API. For example `ee/lib/api/npm_packages.rb`
|
||||||
|
implements API endpoints to work with NPM clients. So, the first thing to do is to
|
||||||
|
add a new `ee/lib/api/your_name_packages.rb` file with API endpoints that are
|
||||||
|
necessary to make the package system client to work. Usually that means having
|
||||||
|
endpoints like:
|
||||||
|
|
||||||
|
- GET package information.
|
||||||
|
- GET package file content.
|
||||||
|
- PUT upload package.
|
||||||
|
|
||||||
|
Since the packages belong to a project, it's expected to have project-level endpoint
|
||||||
|
for uploading and downloading them. For example:
|
||||||
|
|
||||||
|
```
|
||||||
|
GET https://gitlab.com/api/v4/projects/<your_project_id>/packages/npm/
|
||||||
|
PUT https://gitlab.com/api/v4/projects/<your_project_id>/packages/npm/
|
||||||
|
```
|
||||||
|
|
||||||
|
Group-level and instance-level endpoints are good to have but are optional.
|
||||||
|
|
||||||
|
NOTE: **Note:**
|
||||||
|
To avoid name conflict for instance-level endpoints we use
|
||||||
|
[the package naming convention](https://docs.gitlab.com/ee/user/project/packages/npm_registry.html#package-naming-convention)
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
GitLab has a `packages` section in its configuration file (`gitlab.rb`).
|
||||||
|
It applies to all package systems supported by GitLab. Usually you don't need
|
||||||
|
to add anything there.
|
||||||
|
|
||||||
|
Packages can be configured to use object storage, therefore your code must support it.
|
||||||
|
|
||||||
|
## Database
|
||||||
|
|
||||||
|
The current database model allows you to store a name and a version for each package.
|
||||||
|
Every time you upload a new package, you can either create a new record of `Package`
|
||||||
|
or add files to existing record. `PackageFile` should be able to store all file-related
|
||||||
|
information like the file `name`, `side`, `sha1`, etc.
|
||||||
|
|
||||||
|
If there is specific data necessary to be stored for only one package system support,
|
||||||
|
consider creating a separate metadata model. See `packages_maven_metadata` table
|
||||||
|
and `Packages::MavenMetadatum` model as example for package specific data.
|
|
@ -28,6 +28,24 @@ bin/rake "gitlab:seed:issues[group-path/project-path]"
|
||||||
By default, this seeds an average of 2 issues per week for the last 5 weeks per
|
By default, this seeds an average of 2 issues per week for the last 5 weeks per
|
||||||
project.
|
project.
|
||||||
|
|
||||||
|
#### Seeding issues for Insights charts **[ULTIMATE]**
|
||||||
|
|
||||||
|
You can seed issues specifically for working with the
|
||||||
|
[Insights charts](https://docs.gitlab.com/ee/user/group/insights/index.html) with the
|
||||||
|
`gitlab:seed:insights:issues` task:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
# All projects
|
||||||
|
bin/rake gitlab:seed:insights:issues
|
||||||
|
|
||||||
|
# A specific project
|
||||||
|
bin/rake "gitlab:seed:insights:issues[group-path/project-path]"
|
||||||
|
```
|
||||||
|
|
||||||
|
By default, this seeds an average of 10 issues per week for the last 52 weeks
|
||||||
|
per project. All issues will also be randomly labeled with team, type, severity,
|
||||||
|
and priority.
|
||||||
|
|
||||||
### Automation
|
### Automation
|
||||||
|
|
||||||
If you're very sure that you want to **wipe the current database** and refill
|
If you're very sure that you want to **wipe the current database** and refill
|
||||||
|
|
Loading…
Reference in New Issue