gitlab-org--gitlab-foss/doc/development/elasticsearch.md

# Elasticsearch knowledge **[STARTER ONLY]**

This area is to maintain a compendium of useful information when working with elasticsearch.

Information on how to enable ElasticSearch and perform the initial indexing is kept in ../integration/elasticsearch.md#enabling-elasticsearch

## Initial installation on OS X

It is recommended to use the Docker image. After installing docker you can immediately spin up an instance with

```
docker run --name elastic56 -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:5.6.12
```

and use `docker stop elastic56` and `docker start elastic56` to stop/start it.

### Installing on the host

We currently only support Elasticsearch [5.6 to 6.x](../integration/elasticsearch.md#version-requirements)

Version 5.6 is available on homebrew and is the recommended version to use in order to test compatibility.

```
brew install elasticsearch@5.6
```

There is no need to install any plugins

## New repo indexer (beta)

If you're interested on working with the new beta repo indexer, all you need to do is:

- git clone git@gitlab.com:gitlab-org/gitlab-elasticsearch-indexer.git
- make
- make install

this adds `gitlab-elasticsearch-indexer` to `$GOPATH/bin`, please make sure that is in your `$PATH`. After that GitLab will find it and you'll be able to enable it in the admin settings area.

**note:** `make` will not recompile the executable unless you do `make clean` beforehand

## Helpful rake tasks

- `gitlab:elastic:test:index_size`: Tells you how much space the current index is using, as well as how many documents are in the index.
- `gitlab:elastic:test:index_size_change`: Outputs index size, reindexes, and outputs index size again. Useful when testing improvements to indexing size.

Additionally, if you need large repos or multiple forks for testing, please consider [following these instructions](rake_tasks.md#extra-project-seed-options)

## How does it work?

The ElasticSearch integration depends on an external indexer. We ship a [ruby indexer](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/bin/elastic_repo_indexer) by default but are also working on an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a rake task, but after this is done GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [/ee/app/models/concerns/elastic/application_search.rb](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/app/models/concerns/elastic/application_search.rb).

All indexing after the initial one is done via `ElasticIndexerWorker` (sidekiq jobs).

Search queries are generated by the concerns found in [ee/app/models/concerns/elastic](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them!

## Existing Analyzers/Tokenizers/Filters
These are all defined in https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/elasticsearch/git/model.rb

### Analyzers
#### `path_analyzer`
Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters.

Please see the `path_tokenizer` explanation below for an example.

#### `sha_analyzer`
Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters.

Please see the `sha_tokenizer` explanation later below for an example.

#### `code_analyzer`
Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: `code`, `edgeNGram_filter`, `lowercase`, and `asciifolding`

The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched.

Please see the `code` filter for an explanation on how tokens are split.

#### `code_search_analyzer`
Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.

### Tokenizers
#### `sha_tokenizer`
This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searcheable by any sub-set of it (minimum of 5 chars).

example:

`240c29dc7e` becomes:
- `240c2`
- `240c29`
- `240c29d`
- `240c29dc`
- `240c29dc7`
- `240c29dc7e`

#### `path_tokenizer`
This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` in order to allow searches to find paths no matter how much or how little of the path is given as input.

example:

`'/some/path/application.js'` becomes:
- `'/some/path/application.js'`
- `'some/path/application.js'`
- `'path/application.js'`
- `'application.js'`

### Filters
#### `code`
Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves. 

Patterns:
- `"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCased and lowedCameCased strings as separate tokens
- `"(\\d+)"`: extracts digits
- `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCased strings recursively. Ex: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]`
- `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
- `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
- `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
- `'\/?([^\/]+)(?=\/|\b)'`: separate path terms `like/this/one`

#### `edgeNGram_filter`
Uses an [Edge NGram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenfilter.html) to allow inputs with only parts of a token to find the token. For example it would turn `glasses` into permutations starting with `gl` and ending with `glasses`, which would allow a search for "`glass`" to find the original token `glasses`

## Gotchas

- Searches can have their own analyzers. Remember to check when editing analyzers
- `Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches

## Troubleshooting

### Getting "flood stage disk watermark [95%] exceeded"

You might get an error such as

```
[2018-10-31T15:54:19,762][WARN ][o.e.c.r.a.DiskThresholdMonitor] [pval5Ct] 
   flood stage disk watermark [95%] exceeded on 
   [pval5Ct7SieH90t5MykM5w][pval5Ct][/usr/local/var/lib/elasticsearch/nodes/0] free: 56.2gb[3%], 
   all indices on this node will be marked read-only
```

This is because you've exceeded the disk space threshold - it thinks you don't have enough disk space left, based on the default 95% threshold.  

In addition, the `read_only_allow_delete` setting will be set to `true`.  It will block indexing, `forcemerge`, etc

```
curl "http://localhost:9200/gitlab-development/_settings?pretty"
```

Add this to your `elasticsearch.yml` file:

```
# turn off the disk allocator
cluster.routing.allocation.disk.threshold_enabled: false 
```

_or_

```
# set your own limits
cluster.routing.allocation.disk.threshold_enabled: true 
cluster.routing.allocation.disk.watermark.flood_stage: 5gb   # ES 6.x only
cluster.routing.allocation.disk.watermark.low: 15gb 
cluster.routing.allocation.disk.watermark.high: 10gb
```

Restart ElasticSearch, and the `read_only_allow_delete` will clear on it's own.

_from "Disk-based Shard Allocation | Elasticsearch Reference" [5.6](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/disk-allocator.html#disk-allocator) and [6.x](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/disk-allocator.html)_
Docs: Merge EE doc/development to CE 2019-05-05 09:57:21 -04:00			`# Elasticsearch knowledge [STARTER ONLY]`

			`This area is to maintain a compendium of useful information when working with elasticsearch.`

Use relative URLs in development docs This is part of https://gitlab.com/gitlab-org/gitlab-ce/issues/61945 2019-05-19 19:27:22 -04:00			`Information on how to enable ElasticSearch and perform the initial indexing is kept in ../integration/elasticsearch.md#enabling-elasticsearch`
Docs: Merge EE doc/development to CE 2019-05-05 09:57:21 -04:00
			`## Initial installation on OS X`

			`It is recommended to use the Docker image. After installing docker you can immediately spin up an instance with`

			```
			`docker run --name elastic56 -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:5.6.12`
			```

			and use `docker stop elastic56` and `docker start elastic56` to stop/start it.

			`### Installing on the host`

Use relative URLs in development docs This is part of https://gitlab.com/gitlab-org/gitlab-ce/issues/61945 2019-05-19 19:27:22 -04:00			`We currently only support Elasticsearch [5.6 to 6.x](../integration/elasticsearch.md#version-requirements)`
Docs: Merge EE doc/development to CE 2019-05-05 09:57:21 -04:00
			`Version 5.6 is available on homebrew and is the recommended version to use in order to test compatibility.`

			```
			`brew install elasticsearch@5.6`
			```

			`There is no need to install any plugins`

			`## New repo indexer (beta)`

			`If you're interested on working with the new beta repo indexer, all you need to do is:`

			`- git clone git@gitlab.com:gitlab-org/gitlab-elasticsearch-indexer.git`
			`- make`
			`- make install`

			this adds `gitlab-elasticsearch-indexer` to `$GOPATH/bin`, please make sure that is in your `$PATH`. After that GitLab will find it and you'll be able to enable it in the admin settings area.

			note: `make` will not recompile the executable unless you do `make clean` beforehand

			`## Helpful rake tasks`

			- `gitlab:elastic:test:index_size`: Tells you how much space the current index is using, as well as how many documents are in the index.
			- `gitlab:elastic:test:index_size_change`: Outputs index size, reindexes, and outputs index size again. Useful when testing improvements to indexing size.

Use relative URLs in development docs This is part of https://gitlab.com/gitlab-org/gitlab-ce/issues/61945 2019-05-19 19:27:22 -04:00			`Additionally, if you need large repos or multiple forks for testing, please consider [following these instructions](rake_tasks.md#extra-project-seed-options)`
Docs: Merge EE doc/development to CE 2019-05-05 09:57:21 -04:00
			`## How does it work?`

			The ElasticSearch integration depends on an external indexer. We ship a [ruby indexer](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/bin/elastic_repo_indexer) by default but are also working on an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a rake task, but after this is done GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [/ee/app/models/concerns/elastic/application_search.rb](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/app/models/concerns/elastic/application_search.rb).

			All indexing after the initial one is done via `ElasticIndexerWorker` (sidekiq jobs).

			`Search queries are generated by the concerns found in [ee/app/models/concerns/elastic](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them!`

			`## Existing Analyzers/Tokenizers/Filters`
			`These are all defined in https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/elasticsearch/git/model.rb`

			`### Analyzers`
			#### `path_analyzer`
			Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters.

			Please see the `path_tokenizer` explanation below for an example.

			#### `sha_analyzer`
			Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters.

			Please see the `sha_tokenizer` explanation later below for an example.

			#### `code_analyzer`
			Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: `code`, `edgeNGram_filter`, `lowercase`, and `asciifolding`

			The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched.

			Please see the `code` filter for an explanation on how tokens are split.

			#### `code_search_analyzer`
			Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.

			`### Tokenizers`
			#### `sha_tokenizer`
			This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searcheable by any sub-set of it (minimum of 5 chars).

			`example:`

			`240c29dc7e` becomes:
			- `240c2`
			- `240c29`
			- `240c29d`
			- `240c29dc`
			- `240c29dc7`
			- `240c29dc7e`

			#### `path_tokenizer`
			This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` in order to allow searches to find paths no matter how much or how little of the path is given as input.

			`example:`

			`'/some/path/application.js'` becomes:
			- `'/some/path/application.js'`
			- `'some/path/application.js'`
			- `'path/application.js'`
			- `'application.js'`

			`### Filters`
			#### `code`
			`Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves.`

			`Patterns:`
			- `"(\\p{Ll}+\|\\p{Lu}\\p{Ll}+\|\\p{Lu}+)"`: captures CamelCased and lowedCameCased strings as separate tokens
			- `"(\\d+)"`: extracts digits
			- `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCased strings recursively. Ex: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]`
			- `'"((?:\\"\|[^"]\|\\")*)"'`: captures terms inside quotes, removing the quotes
			- `"'((?:\\'\|[^']\|\\')*)'"`: same as above, for single-quotes
			- `'\.([^.]+)(?=\.\|\s\|\Z)'`: separate terms with periods in-between
			- `'\/?([^\/]+)(?=\/\|\b)'`: separate path terms `like/this/one`

			#### `edgeNGram_filter`
			Uses an [Edge NGram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenfilter.html) to allow inputs with only parts of a token to find the token. For example it would turn `glasses` into permutations starting with `gl` and ending with `glasses`, which would allow a search for "`glass`" to find the original token `glasses`

			`## Gotchas`

			`- Searches can have their own analyzers. Remember to check when editing analyzers`
			- `Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches

			`## Troubleshooting`

			`### Getting "flood stage disk watermark [95%] exceeded"`

			`You might get an error such as`

			```
			`[2018-10-31T15:54:19,762][WARN ][o.e.c.r.a.DiskThresholdMonitor] [pval5Ct]`
			`flood stage disk watermark [95%] exceeded on`
			`[pval5Ct7SieH90t5MykM5w][pval5Ct][/usr/local/var/lib/elasticsearch/nodes/0] free: 56.2gb[3%],`
			`all indices on this node will be marked read-only`
			```

			`This is because you've exceeded the disk space threshold - it thinks you don't have enough disk space left, based on the default 95% threshold.`

			In addition, the `read_only_allow_delete` setting will be set to `true`. It will block indexing, `forcemerge`, etc

			```
			`curl "http://localhost:9200/gitlab-development/_settings?pretty"`
			```

			Add this to your `elasticsearch.yml` file:

			```
			`# turn off the disk allocator`
			`cluster.routing.allocation.disk.threshold_enabled: false`
			```

			`_or_`

			```
			`# set your own limits`
			`cluster.routing.allocation.disk.threshold_enabled: true`
			`cluster.routing.allocation.disk.watermark.flood_stage: 5gb # ES 6.x only`
			`cluster.routing.allocation.disk.watermark.low: 15gb`
			`cluster.routing.allocation.disk.watermark.high: 10gb`
			```

			Restart ElasticSearch, and the `read_only_allow_delete` will clear on it's own.

			`_from "Disk-based Shard Allocation \| Elasticsearch Reference" [5.6](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/disk-allocator.html#disk-allocator) and [6.x](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/disk-allocator.html)_`