gitlab-org--gitlab-foss/doc/development/sql.md

# SQL Query Guidelines

This document describes various guidelines to follow when writing SQL queries,
either using ActiveRecord/Arel or raw SQL queries.

## Using LIKE Statements

The most common way to search for data is using the `LIKE` statement. For
example, to get all issues with a title starting with "WIP:" you'd write the
following query:

```sql
SELECT *
FROM issues
WHERE title LIKE 'WIP:%';
```

On PostgreSQL the `LIKE` statement is case-sensitive. To perform a case-insensitive
`LIKE` you have to use `ILIKE` instead.

To handle this automatically you should use `LIKE` queries using Arel instead
of raw SQL fragments, as Arel automatically uses `ILIKE` on PostgreSQL.

```ruby
Issue.where('title LIKE ?', 'WIP:%')
```

You'd write this instead:

```ruby
Issue.where(Issue.arel_table[:title].matches('WIP:%'))
```

Here `matches` generates the correct `LIKE` / `ILIKE` statement depending on the
database being used.

If you need to chain multiple `OR` conditions you can also do this using Arel:

```ruby
table = Issue.arel_table

Issue.where(table[:title].matches('WIP:%').or(table[:foo].matches('WIP:%')))
```

On PostgreSQL, this produces:

```sql
SELECT *
FROM issues
WHERE (title ILIKE 'WIP:%' OR foo ILIKE 'WIP:%')
```

## LIKE & Indexes

PostgreSQL won't use any indexes when using `LIKE` / `ILIKE` with a wildcard at
the start. For example, this will not use any indexes:

```sql
SELECT *
FROM issues
WHERE title ILIKE '%WIP:%';
```

Because the value for `ILIKE` starts with a wildcard the database is not able to
use an index as it doesn't know where to start scanning the indexes.

Luckily, PostgreSQL _does_ provide a solution: trigram GIN indexes. These
indexes can be created as follows:

```sql
CREATE INDEX [CONCURRENTLY] index_name_here
ON table_name
USING GIN(column_name gin_trgm_ops);
```

The key here is the `GIN(column_name gin_trgm_ops)` part. This creates a [GIN
index](https://www.postgresql.org/docs/current/gin.html) with the operator class set to `gin_trgm_ops`. These indexes
_can_ be used by `ILIKE` / `LIKE` and can lead to greatly improved performance.
One downside of these indexes is that they can easily get quite large (depending
on the amount of data indexed).

To keep naming of these indexes consistent please use the following naming
pattern:

```
index_TABLE_on_COLUMN_trigram
```

For example, a GIN/trigram index for `issues.title` would be called
`index_issues_on_title_trigram`.

Due to these indexes taking quite some time to be built they should be built
concurrently. This can be done by using `CREATE INDEX CONCURRENTLY` instead of
just `CREATE INDEX`. Concurrent indexes can _not_ be created inside a
transaction. Transactions for migrations can be disabled using the following
pattern:

```ruby
class MigrationName < ActiveRecord::Migration[4.2]
  disable_ddl_transaction!
end
```

For example:

```ruby
class AddUsersLowerUsernameEmailIndexes < ActiveRecord::Migration[4.2]
  disable_ddl_transaction!

  def up
    return unless Gitlab::Database.postgresql?

    execute 'CREATE INDEX CONCURRENTLY index_on_users_lower_username ON users (LOWER(username));'
    execute 'CREATE INDEX CONCURRENTLY index_on_users_lower_email ON users (LOWER(email));'
  end

  def down
    return unless Gitlab::Database.postgresql?

    remove_index :users, :index_on_users_lower_username
    remove_index :users, :index_on_users_lower_email
  end
end
```

## Plucking IDs

This can't be stressed enough: **never** use ActiveRecord's `pluck` to pluck a
set of values into memory only to use them as an argument for another query. For
example, this will make the database **very** sad:

```ruby
projects = Project.all.pluck(:id)

MergeRequest.where(source_project_id: projects)
```

Instead you can just use sub-queries which perform far better:

```ruby
MergeRequest.where(source_project_id: Project.all.select(:id))
```

The _only_ time you should use `pluck` is when you actually need to operate on
the values in Ruby itself (e.g. write them to a file). In almost all other cases
you should ask yourself "Can I not just use a sub-query?".

In line with our `CodeReuse/ActiveRecord` cop, you should only use forms like
`pluck(:id)` or `pluck(:user_id)` within model code. In the former case, you can
use the `ApplicationRecord`-provided `.pluck_primary_key` helper method instead.
In the latter, you should add a small helper method to the relevant model.

## Inherit from ApplicationRecord

Most models in the GitLab codebase should inherit from `ApplicationRecord`,
rather than from `ActiveRecord::Base`. This allows helper methods to be easily
added.

An exception to this rule exists for models created in database migrations. As
these should be isolated from application code, they should continue to subclass
from `ActiveRecord::Base`.

## Use UNIONs

UNIONs aren't very commonly used in most Rails applications but they're very
powerful and useful. In most applications queries tend to use a lot of JOINs to
get related data or data based on certain criteria, but JOIN performance can
quickly deteriorate as the data involved grows.

For example, if you want to get a list of projects where the name contains a
value _or_ the name of the namespace contains a value most people would write
the following query:

```sql
SELECT *
FROM projects
JOIN namespaces ON namespaces.id = projects.namespace_id
WHERE projects.name ILIKE '%gitlab%'
OR namespaces.name ILIKE '%gitlab%';
```

Using a large database this query can easily take around 800 milliseconds to
run. Using a UNION we'd write the following instead:

```sql
SELECT projects.*
FROM projects
WHERE projects.name ILIKE '%gitlab%'

UNION

SELECT projects.*
FROM projects
JOIN namespaces ON namespaces.id = projects.namespace_id
WHERE namespaces.name ILIKE '%gitlab%';
```

This query in turn only takes around 15 milliseconds to complete while returning
the exact same records.

This doesn't mean you should start using UNIONs everywhere, but it's something
to keep in mind when using lots of JOINs in a query and filtering out records
based on the joined data.

GitLab comes with a `Gitlab::SQL::Union` class that can be used to build a UNION
of multiple `ActiveRecord::Relation` objects. You can use this class as
follows:

```ruby
union = Gitlab::SQL::Union.new([projects, more_projects, ...])

Project.from("(#{union.to_sql}) projects")
```

## Ordering by Creation Date

When ordering records based on the time they were created you can simply order
by the `id` column instead of ordering by `created_at`. Because IDs are always
unique and incremented in the order that rows are created this will produce the
exact same results. This also means there's no need to add an index on
`created_at` to ensure consistent performance as `id` is already indexed by
default.

## Use WHERE EXISTS instead of WHERE IN

While `WHERE IN` and `WHERE EXISTS` can be used to produce the same data it is
recommended to use `WHERE EXISTS` whenever possible. While in many cases
PostgreSQL can optimise `WHERE IN` quite well there are also many cases where
`WHERE EXISTS` will perform (much) better.

In Rails you have to use this by creating SQL fragments:

```ruby
Project.where('EXISTS (?)', User.select(1).where('projects.creator_id = users.id AND users.foo = X'))
```

This would then produce a query along the lines of the following:

```sql
SELECT *
FROM projects
WHERE EXISTS (
    SELECT 1
    FROM users
    WHERE projects.creator_id = users.id
    AND users.foo = X
)
```

## `.find_or_create_by` is not atomic

The inherent pattern with methods like `.find_or_create_by` and
`.first_or_create` and others is that they are not atomic. This means,
it first runs a `SELECT`, and if there are no results an `INSERT` is
performed. With concurrent processes in mind, there is a race condition
which may lead to trying to insert two similar records. This may not be
desired, or may cause one of the queries to fail due to a constraint
violation, for example.

Using transactions does not solve this problem.

To solve this we've added the `ApplicationRecord.safe_find_or_create_by`.

This method can be used just as you would the normal
`find_or_create_by` but it wraps the call in a *new* transaction and
retries if it were to fail because of an
`ActiveRecord::RecordNotUnique` error.

To be able to use this method, make sure the model you want to use
this on inherits from `ApplicationRecord`.
Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00			`# SQL Query Guidelines`

			`This document describes various guidelines to follow when writing SQL queries,`
			`either using ActiveRecord/Arel or raw SQL queries.`

			`## Using LIKE Statements`

			The most common way to search for data is using the `LIKE` statement. For
			`example, to get all issues with a title starting with "WIP:" you'd write the`
			`following query:`

			```sql
			`SELECT *`
			`FROM issues`
			`WHERE title LIKE 'WIP:%';`
			```

Remove MySQL references from development docs I noticed the doc/development/testing_guide/best_practices.md still referenced the `[run mysql]` tags, etc. They no longer work, so I removed them, then realised I had better clean up the rest of doc/development ! 2019-08-12 06:29:10 -04:00			On PostgreSQL the `LIKE` statement is case-sensitive. To perform a case-insensitive
			`LIKE` you have to use `ILIKE` instead.
Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00
Remove MySQL references from development docs I noticed the doc/development/testing_guide/best_practices.md still referenced the `[run mysql]` tags, etc. They no longer work, so I removed them, then realised I had better clean up the rest of doc/development ! 2019-08-12 06:29:10 -04:00			To handle this automatically you should use `LIKE` queries using Arel instead
			of raw SQL fragments, as Arel automatically uses `ILIKE` on PostgreSQL.
Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00
			```ruby
			`Issue.where('title LIKE ?', 'WIP:%')`
			```

			`You'd write this instead:`

			```ruby
			`Issue.where(Issue.arel_table[:title].matches('WIP:%'))`
			```

			Here `matches` generates the correct `LIKE` / `ILIKE` statement depending on the
			`database being used.`

			If you need to chain multiple `OR` conditions you can also do this using Arel:

			```ruby
			`table = Issue.arel_table`

			`Issue.where(table[:title].matches('WIP:%').or(table[:foo].matches('WIP:%')))`
			```

Remove MySQL references from development docs I noticed the doc/development/testing_guide/best_practices.md still referenced the `[run mysql]` tags, etc. They no longer work, so I removed them, then realised I had better clean up the rest of doc/development ! 2019-08-12 06:29:10 -04:00			`On PostgreSQL, this produces:`
Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00
			```sql
			`SELECT *`
			`FROM issues`
			`WHERE (title ILIKE 'WIP:%' OR foo ILIKE 'WIP:%')`
			```

			`## LIKE & Indexes`

Remove MySQL references from development docs I noticed the doc/development/testing_guide/best_practices.md still referenced the `[run mysql]` tags, etc. They no longer work, so I removed them, then realised I had better clean up the rest of doc/development ! 2019-08-12 06:29:10 -04:00			PostgreSQL won't use any indexes when using `LIKE` / `ILIKE` with a wildcard at
			`the start. For example, this will not use any indexes:`
Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00
			```sql
			`SELECT *`
			`FROM issues`
			`WHERE title ILIKE '%WIP:%';`
			```

			Because the value for `ILIKE` starts with a wildcard the database is not able to
			`use an index as it doesn't know where to start scanning the indexes.`

Remove MySQL references from development docs I noticed the doc/development/testing_guide/best_practices.md still referenced the `[run mysql]` tags, etc. They no longer work, so I removed them, then realised I had better clean up the rest of doc/development ! 2019-08-12 06:29:10 -04:00			`Luckily, PostgreSQL _does_ provide a solution: trigram GIN indexes. These`
			`indexes can be created as follows:`
Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00
			```sql
			`CREATE INDEX [CONCURRENTLY] index_name_here`
			`ON table_name`
			`USING GIN(column_name gin_trgm_ops);`
			```

			The key here is the `GIN(column_name gin_trgm_ops)` part. This creates a [GIN
Add latest changes from gitlab-org/gitlab@master 2019-09-27 08:06:07 -04:00			index](https://www.postgresql.org/docs/current/gin.html) with the operator class set to `gin_trgm_ops`. These indexes
Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00			_can_ be used by `ILIKE` / `LIKE` and can lead to greatly improved performance.
			`One downside of these indexes is that they can easily get quite large (depending`
			`on the amount of data indexed).`

			`To keep naming of these indexes consistent please use the following naming`
			`pattern:`

Fix spacing of code blocks Code blocks should not be spaced 4 times, as this will prevent the code from being colored. They should also be spaced the same as the lists they are a part of, to make reading easier. 2019-07-09 03:16:17 -04:00			```
			`index_TABLE_on_COLUMN_trigram`
			```
Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00
			For example, a GIN/trigram index for `issues.title` would be called
			`index_issues_on_title_trigram`.

			`Due to these indexes taking quite some time to be built they should be built`
			concurrently. This can be done by using `CREATE INDEX CONCURRENTLY` instead of
			just `CREATE INDEX`. Concurrent indexes can _not_ be created inside a
			`transaction. Transactions for migrations can be disabled using the following`
			`pattern:`

			```ruby
Fix ActiveRecord::Migration deprecations Extending from ActiveRecord::Migration is deprecated, but was still used in a bunch of places. 2018-12-12 10:38:40 -05:00			`class MigrationName < ActiveRecord::Migration[4.2]`
Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00			`disable_ddl_transaction!`
			`end`
			```

			`For example:`

			```ruby
Fix ActiveRecord::Migration deprecations Extending from ActiveRecord::Migration is deprecated, but was still used in a bunch of places. 2018-12-12 10:38:40 -05:00			`class AddUsersLowerUsernameEmailIndexes < ActiveRecord::Migration[4.2]`
Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00			`disable_ddl_transaction!`

			`def up`
			`return unless Gitlab::Database.postgresql?`

			`execute 'CREATE INDEX CONCURRENTLY index_on_users_lower_username ON users (LOWER(username));'`
			`execute 'CREATE INDEX CONCURRENTLY index_on_users_lower_email ON users (LOWER(email));'`
			`end`

			`def down`
			`return unless Gitlab::Database.postgresql?`

			`remove_index :users, :index_on_users_lower_username`
			`remove_index :users, :index_on_users_lower_email`
			`end`
			`end`
			```

			`## Plucking IDs`

			This can't be stressed enough: never use ActiveRecord's `pluck` to pluck a
			`set of values into memory only to use them as an argument for another query. For`
			`example, this will make the database very sad:`

			```ruby
			`projects = Project.all.pluck(:id)`

			`MergeRequest.where(source_project_id: projects)`
			```

			`Instead you can just use sub-queries which perform far better:`

			```ruby
			`MergeRequest.where(source_project_id: Project.all.select(:id))`
			```

			The _only_ time you should use `pluck` is when you actually need to operate on
			`the values in Ruby itself (e.g. write them to a file). In almost all other cases`
			`you should ask yourself "Can I not just use a sub-query?".`

Document ApplicationRecord / pluck_primary_key We also enable the rubocop that makes it mandatory 2019-03-29 07:23:05 -04:00			In line with our `CodeReuse/ActiveRecord` cop, you should only use forms like
			`pluck(:id)` or `pluck(:user_id)` within model code. In the former case, you can
			use the `ApplicationRecord`-provided `.pluck_primary_key` helper method instead.
			`In the latter, you should add a small helper method to the relevant model.`

			`## Inherit from ApplicationRecord`

			Most models in the GitLab codebase should inherit from `ApplicationRecord`,
			rather than from `ActiveRecord::Base`. This allows helper methods to be easily
			`added.`

			`An exception to this rule exists for models created in database migrations. As`
			`these should be isolated from application code, they should continue to subclass`
			from `ActiveRecord::Base`.

Added basic SQL guidelines [ci skip] 2016-03-04 07:08:24 -05:00			`## Use UNIONs`

			`UNIONs aren't very commonly used in most Rails applications but they're very`
			`powerful and useful. In most applications queries tend to use a lot of JOINs to`
			`get related data or data based on certain criteria, but JOIN performance can`
			`quickly deteriorate as the data involved grows.`

			`For example, if you want to get a list of projects where the name contains a`
			`value _or_ the name of the namespace contains a value most people would write`
			`the following query:`

			```sql
			`SELECT *`
			`FROM projects`
			`JOIN namespaces ON namespaces.id = projects.namespace_id`
			`WHERE projects.name ILIKE '%gitlab%'`
			`OR namespaces.name ILIKE '%gitlab%';`
			```

			`Using a large database this query can easily take around 800 milliseconds to`
			`run. Using a UNION we'd write the following instead:`

			```sql
			`SELECT projects.*`
			`FROM projects`
			`WHERE projects.name ILIKE '%gitlab%'`

			`UNION`

			`SELECT projects.*`
			`FROM projects`
			`JOIN namespaces ON namespaces.id = projects.namespace_id`
			`WHERE namespaces.name ILIKE '%gitlab%';`
			```

			`This query in turn only takes around 15 milliseconds to complete while returning`
			`the exact same records.`

			`This doesn't mean you should start using UNIONs everywhere, but it's something`
			`to keep in mind when using lots of JOINs in a query and filtering out records`
			`based on the joined data.`

			GitLab comes with a `Gitlab::SQL::Union` class that can be used to build a UNION
			of multiple `ActiveRecord::Relation` objects. You can use this class as
			`follows:`

			```ruby
			`union = Gitlab::SQL::Union.new([projects, more_projects, ...])`

			`Project.from("(#{union.to_sql}) projects")`
			```

			`## Ordering by Creation Date`

			`When ordering records based on the time they were created you can simply order`
			by the `id` column instead of ordering by `created_at`. Because IDs are always
			`unique and incremented in the order that rows are created this will produce the`
			`exact same results. This also means there's no need to add an index on`
			`created_at` to ensure consistent performance as `id` is already indexed by
			`default.`

Add more database development related docs 2017-08-10 11:53:20 -04:00			`## Use WHERE EXISTS instead of WHERE IN`

			While `WHERE IN` and `WHERE EXISTS` can be used to produce the same data it is
			recommended to use `WHERE EXISTS` whenever possible. While in many cases
			PostgreSQL can optimise `WHERE IN` quite well there are also many cases where
			`WHERE EXISTS` will perform (much) better.

			`In Rails you have to use this by creating SQL fragments:`

			```ruby
			`Project.where('EXISTS (?)', User.select(1).where('projects.creator_id = users.id AND users.foo = X'))`
			```

			`This would then produce a query along the lines of the following:`

			```sql
			`SELECT *`
			`FROM projects`
			`WHERE EXISTS (`
			`SELECT 1`
			`FROM users`
			`WHERE projects.creator_id = users.id`
			`AND users.foo = X`
			`)`
			```

Document pattern for .find_or_create and similar methods. 2018-07-16 10:19:22 -04:00			## `.find_or_create_by` is not atomic

			The inherent pattern with methods like `.find_or_create_by` and
			`.first_or_create` and others is that they are not atomic. This means,
			it first runs a `SELECT`, and if there are no results an `INSERT` is
			`performed. With concurrent processes in mind, there is a race condition`
			`which may lead to trying to insert two similar records. This may not be`
			`desired, or may cause one of the queries to fail due to a constraint`
			`violation, for example.`

			`Using transactions does not solve this problem.`

Adds helper for `find_or_create_by` in transaction This allows us to call `find_or_create_by` on all models and scopes. 2019-02-04 08:39:54 -05:00			To solve this we've added the `ApplicationRecord.safe_find_or_create_by`.
Document pattern for .find_or_create and similar methods. 2018-07-16 10:19:22 -04:00
Adds helper for `find_or_create_by` in transaction This allows us to call `find_or_create_by` on all models and scopes. 2019-02-04 08:39:54 -05:00			`This method can be used just as you would the normal`
			`find_or_create_by` but it wraps the call in a new transaction and
			`retries if it were to fail because of an`
			`ActiveRecord::RecordNotUnique` error.
Document pattern for .find_or_create and similar methods. 2018-07-16 10:19:22 -04:00
Adds helper for `find_or_create_by` in transaction This allows us to call `find_or_create_by` on all models and scopes. 2019-02-04 08:39:54 -05:00			`To be able to use this method, make sure the model you want to use`
			this on inherits from `ApplicationRecord`.