Merge branch 'document-when-to-add-indexes' into 'master'
Added documentation on adding database indexes ## What does this MR do? This MR adds a guide on when to add database indexes, how to approach this problem, etc ## Are there points in the code the reviewer needs to double check? Spelling, grammar, etc ## Why was this MR needed? Blindly adding indexes can be harmful to the database, and I recently had to remove 48-something unused indexes. ## What are the relevant issue numbers? #20767 ## Does this MR meet the acceptance criteria? - [x] [Documentation created/updated](https://gitlab.com/gitlab-org/gitlab-ce/blob/master/doc/development/doc_styleguide.md) - Tests - [x] All builds are passing - [ ] Conform by the [style guides](https://gitlab.com/gitlab-org/gitlab-ce/blob/master/CONTRIBUTING.md#style-guides) - [x] Branch has no merge conflicts with `master` (if you do - rebase it please) - [x] [Squashed related commits together](https://git-scm.com/book/en/Git-Tools-Rewriting-History#Squashing-Commits) See merge request !5771
This commit is contained in:
commit
07d5759408
2 changed files with 127 additions and 0 deletions
|
@ -30,7 +30,11 @@
|
|||
- [Rake tasks](rake_tasks.md) for development
|
||||
- [Shell commands](shell_commands.md) in the GitLab codebase
|
||||
- [Sidekiq debugging](sidekiq_debugging.md)
|
||||
|
||||
## Databases
|
||||
|
||||
- [What requires downtime?](what_requires_downtime.md)
|
||||
- [Adding database indexes](adding_database_indexes.md)
|
||||
|
||||
## Compliance
|
||||
|
||||
|
|
123
doc/development/adding_database_indexes.md
Normal file
123
doc/development/adding_database_indexes.md
Normal file
|
@ -0,0 +1,123 @@
|
|||
# Adding Database Indexes
|
||||
|
||||
Indexes can be used to speed up database queries, but when should you add a new
|
||||
index? Traditionally the answer to this question has been to add an index for
|
||||
every column used for filtering or joining data. For example, consider the
|
||||
following query:
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM projects
|
||||
WHERE user_id = 2;
|
||||
```
|
||||
|
||||
Here we are filtering by the `user_id` column and as such a developer may decide
|
||||
to index this column.
|
||||
|
||||
While in certain cases indexing columns using the above approach may make sense
|
||||
it can actually have a negative impact. Whenever you write data to a table any
|
||||
existing indexes need to be updated. The more indexes there are the slower this
|
||||
can potentially become. Indexes can also take up quite some disk space depending
|
||||
on the amount of data indexed and the index type. For example, PostgreSQL offers
|
||||
"GIN" indexes which can be used to index certain data types that can not be
|
||||
indexed by regular btree indexes. These indexes however generally take up more
|
||||
data and are slower to update compared to btree indexes.
|
||||
|
||||
Because of all this one should not blindly add a new index for every column used
|
||||
to filter data by. Instead one should ask themselves the following questions:
|
||||
|
||||
1. Can I write my query in such a way that it re-uses as many existing indexes
|
||||
as possible?
|
||||
2. Is the data going to be large enough that using an index will actually be
|
||||
faster than just iterating over the rows in the table?
|
||||
3. Is the overhead of maintaining the index worth the reduction in query
|
||||
timings?
|
||||
|
||||
We'll explore every question in detail below.
|
||||
|
||||
## Re-using Queries
|
||||
|
||||
The first step is to make sure your query re-uses as many existing indexes as
|
||||
possible. For example, consider the following query:
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM todos
|
||||
WHERE user_id = 123
|
||||
AND state = 'open';
|
||||
```
|
||||
|
||||
Now imagine we already have an index on the `user_id` column but not on the
|
||||
`state` column. One may think this query will perform badly due to `state` being
|
||||
unindexed. In reality the query may perform just fine given the index on
|
||||
`user_id` can filter out enough rows.
|
||||
|
||||
The best way to determine if indexes are re-used is to run your query using
|
||||
`EXPLAIN ANALYZE`. Depending on any extra tables that may be joined and
|
||||
other columns being used for filtering you may find an extra index is not going
|
||||
to make much (if any) difference. On the other hand you may determine that the
|
||||
index _may_ make a difference.
|
||||
|
||||
In short:
|
||||
|
||||
1. Try to write your query in such a way that it re-uses as many existing
|
||||
indexes as possible.
|
||||
2. Run the query using `EXPLAIN ANALYZE` and study the output to find the most
|
||||
ideal query.
|
||||
|
||||
## Data Size
|
||||
|
||||
A database may decide not to use an index despite it existing in case a regular
|
||||
sequence scan (= simply iterating over all existing rows) is faster. This is
|
||||
especially the case for small tables.
|
||||
|
||||
If a table is expected to grow in size and you expect your query has to filter
|
||||
out a lot of rows you may want to consider adding an index. If the table size is
|
||||
very small (e.g. only a handful of rows) or any existing indexes filter out
|
||||
enough rows you may _not_ want to add a new index.
|
||||
|
||||
## Maintenance Overhead
|
||||
|
||||
Indexes have to be updated on every table write. In case of PostgreSQL _all_
|
||||
existing indexes will be updated whenever data is written to a table. As a
|
||||
result of this having many indexes on the same table will slow down writes.
|
||||
|
||||
Because of this one should ask themselves: is the reduction in query performance
|
||||
worth the overhead of maintaining an extra index?
|
||||
|
||||
If adding an index reduces SELECT timings by 5 milliseconds but increases
|
||||
INSERT/UPDATE/DELETE timings by 10 milliseconds then the index may not be worth
|
||||
it. On the other hand, if SELECT timings are reduced but INSERT/UPDATE/DELETE
|
||||
timings are not affected you may want to add the index after all.
|
||||
|
||||
## Finding Unused Indexes
|
||||
|
||||
To see which indexes are unused you can run the following query:
|
||||
|
||||
```sql
|
||||
SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass))
|
||||
FROM pg_stat_all_indexes
|
||||
WHERE schemaname = 'public'
|
||||
AND idx_scan = 0
|
||||
AND idx_tup_read = 0
|
||||
AND idx_tup_fetch = 0
|
||||
ORDER BY pg_relation_size(indexrelname::regclass) desc;
|
||||
```
|
||||
|
||||
This query outputs a list containing all indexes that are never used and sorts
|
||||
them by indexes sizes in descending order. This query can be useful to
|
||||
determine if any previously indexes are useful after all. More information on
|
||||
the meaning of the various columns can be found at
|
||||
<https://www.postgresql.org/docs/current/static/monitoring-stats.html>.
|
||||
|
||||
Because the output of this query relies on the actual usage of your database it
|
||||
may be affected by factors such as (but not limited to):
|
||||
|
||||
* Certain queries never being executed, thus not being able to use certain
|
||||
indexes.
|
||||
* Certain tables having little data, resulting in PostgreSQL using sequence
|
||||
scans instead of index scans.
|
||||
|
||||
In other words, this data is only reliable for a frequently used database with
|
||||
plenty of data and with as many GitLab features enabled (and being used) as
|
||||
possible.
|
Loading…
Reference in a new issue