this adds `gitlab-elasticsearch-indexer` to `$GOPATH/bin`, please make sure that is in your `$PATH`. After that GitLab will find it and you'll be able to enable it in the admin settings area.
**note:** `make` will not recompile the executable unless you do `make clean` beforehand
## Helpful rake tasks
-`gitlab:elastic:test:index_size`: Tells you how much space the current index is using, as well as how many documents are in the index.
-`gitlab:elastic:test:index_size_change`: Outputs index size, reindexes, and outputs index size again. Useful when testing improvements to indexing size.
Additionally, if you need large repos or multiple forks for testing, please consider [following these instructions](rake_tasks.md#extra-project-seed-options)
The ElasticSearch integration depends on an external indexer. We ship a [ruby indexer](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/bin/elastic_repo_indexer) by default but are also working on an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a rake task, but after this is done GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [/ee/app/models/concerns/elastic/application_search.rb](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/app/models/concerns/elastic/application_search.rb).
All indexing after the initial one is done via `ElasticIndexerWorker` (sidekiq jobs).
Search queries are generated by the concerns found in [ee/app/models/concerns/elastic](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them!
## Existing Analyzers/Tokenizers/Filters
These are all defined in https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/elasticsearch/git/model.rb
### Analyzers
#### `path_analyzer`
Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters.
Please see the `path_tokenizer` explanation below for an example.
#### `sha_analyzer`
Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters.
Please see the `sha_tokenizer` explanation later below for an example.
#### `code_analyzer`
Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: `code`, `edgeNGram_filter`, `lowercase`, and `asciifolding`
The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched.
Please see the `code` filter for an explanation on how tokens are split.
#### `code_search_analyzer`
Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.
### Tokenizers
#### `sha_tokenizer`
This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searcheable by any sub-set of it (minimum of 5 chars).
example:
`240c29dc7e` becomes:
-`240c2`
-`240c29`
-`240c29d`
-`240c29dc`
-`240c29dc7`
-`240c29dc7e`
#### `path_tokenizer`
This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` in order to allow searches to find paths no matter how much or how little of the path is given as input.
example:
`'/some/path/application.js'` becomes:
-`'/some/path/application.js'`
-`'some/path/application.js'`
-`'path/application.js'`
-`'application.js'`
### Filters
#### `code`
Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves.
Patterns:
-`"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCased and lowedCameCased strings as separate tokens
-`'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
-`"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
-`'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
-`'\/?([^\/]+)(?=\/|\b)'`: separate path terms `like/this/one`
#### `edgeNGram_filter`
Uses an [Edge NGram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenfilter.html) to allow inputs with only parts of a token to find the token. For example it would turn `glasses` into permutations starting with `gl` and ending with `glasses`, which would allow a search for "`glass`" to find the original token `glasses`
## Gotchas
- Searches can have their own analyzers. Remember to check when editing analyzers
-`Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches
## Troubleshooting
### Getting "flood stage disk watermark [95%] exceeded"