diff --git a/doc/development/README.md b/doc/development/README.md index bf67b5d8dff..57f37da6f80 100644 --- a/doc/development/README.md +++ b/doc/development/README.md @@ -30,7 +30,11 @@ - [Rake tasks](rake_tasks.md) for development - [Shell commands](shell_commands.md) in the GitLab codebase - [Sidekiq debugging](sidekiq_debugging.md) + +## Databases + - [What requires downtime?](what_requires_downtime.md) +- [Adding database indexes](adding_database_indexes.md) ## Compliance diff --git a/doc/development/adding_database_indexes.md b/doc/development/adding_database_indexes.md new file mode 100644 index 00000000000..ea6f14da3b9 --- /dev/null +++ b/doc/development/adding_database_indexes.md @@ -0,0 +1,123 @@ +# Adding Database Indexes + +Indexes can be used to speed up database queries, but when should you add a new +index? Traditionally the answer to this question has been to add an index for +every column used for filtering or joining data. For example, consider the +following query: + +```sql +SELECT * +FROM projects +WHERE user_id = 2; +``` + +Here we are filtering by the `user_id` column and as such a developer may decide +to index this column. + +While in certain cases indexing columns using the above approach may make sense +it can actually have a negative impact. Whenever you write data to a table any +existing indexes need to be updated. The more indexes there are the slower this +can potentially become. Indexes can also take up quite some disk space depending +on the amount of data indexed and the index type. For example, PostgreSQL offers +"GIN" indexes which can be used to index certain data types that can not be +indexed by regular btree indexes. These indexes however generally take up more +data and are slower to update compared to btree indexes. + +Because of all this one should not blindly add a new index for every column used +to filter data by. Instead one should ask themselves the following questions: + +1. Can I write my query in such a way that it re-uses as many existing indexes + as possible? +2. Is the data going to be large enough that using an index will actually be + faster than just iterating over the rows in the table? +3. Is the overhead of maintaining the index worth the reduction in query + timings? + +We'll explore every question in detail below. + +## Re-using Queries + +The first step is to make sure your query re-uses as many existing indexes as +possible. For example, consider the following query: + +```sql +SELECT * +FROM todos +WHERE user_id = 123 +AND state = 'open'; +``` + +Now imagine we already have an index on the `user_id` column but not on the +`state` column. One may think this query will perform badly due to `state` being +unindexed. In reality the query may perform just fine given the index on +`user_id` can filter out enough rows. + +The best way to determine if indexes are re-used is to run your query using +`EXPLAIN ANALYZE`. Depending on any extra tables that may be joined and +other columns being used for filtering you may find an extra index is not going +to make much (if any) difference. On the other hand you may determine that the +index _may_ make a difference. + +In short: + +1. Try to write your query in such a way that it re-uses as many existing + indexes as possible. +2. Run the query using `EXPLAIN ANALYZE` and study the output to find the most + ideal query. + +## Data Size + +A database may decide not to use an index despite it existing in case a regular +sequence scan (= simply iterating over all existing rows) is faster. This is +especially the case for small tables. + +If a table is expected to grow in size and you expect your query has to filter +out a lot of rows you may want to consider adding an index. If the table size is +very small (e.g. only a handful of rows) or any existing indexes filter out +enough rows you may _not_ want to add a new index. + +## Maintenance Overhead + +Indexes have to be updated on every table write. In case of PostgreSQL _all_ +existing indexes will be updated whenever data is written to a table. As a +result of this having many indexes on the same table will slow down writes. + +Because of this one should ask themselves: is the reduction in query performance +worth the overhead of maintaining an extra index? + +If adding an index reduces SELECT timings by 5 milliseconds but increases +INSERT/UPDATE/DELETE timings by 10 milliseconds then the index may not be worth +it. On the other hand, if SELECT timings are reduced but INSERT/UPDATE/DELETE +timings are not affected you may want to add the index after all. + +## Finding Unused Indexes + +To see which indexes are unused you can run the following query: + +```sql +SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass)) +FROM pg_stat_all_indexes +WHERE schemaname = 'public' +AND idx_scan = 0 +AND idx_tup_read = 0 +AND idx_tup_fetch = 0 +ORDER BY pg_relation_size(indexrelname::regclass) desc; +``` + +This query outputs a list containing all indexes that are never used and sorts +them by indexes sizes in descending order. This query can be useful to +determine if any previously indexes are useful after all. More information on +the meaning of the various columns can be found at +. + +Because the output of this query relies on the actual usage of your database it +may be affected by factors such as (but not limited to): + +* Certain queries never being executed, thus not being able to use certain + indexes. +* Certain tables having little data, resulting in PostgreSQL using sequence + scans instead of index scans. + +In other words, this data is only reliable for a frequently used database with +plenty of data and with as many GitLab features enabled (and being used) as +possible.