12 KiB
stage | group | info |
---|---|---|
Data Stores | Database | To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments |
Adding Database Indexes
Indexes can be used to speed up database queries, but when should you add a new index? Traditionally the answer to this question has been to add an index for every column used for filtering or joining data. For example, consider the following query:
SELECT *
FROM projects
WHERE user_id = 2;
Here we are filtering by the user_id
column and as such a developer may decide
to index this column.
While in certain cases indexing columns using the above approach may make sense,
it can actually have a negative impact. Whenever you write data to a table, any
existing indexes must also be updated. The more indexes there are, the slower this
can potentially become. Indexes can also take up significant disk space, depending
on the amount of data indexed and the index type. For example, PostgreSQL offers
GIN
indexes which can be used to index certain data types that cannot be
indexed by regular B-tree indexes. These indexes, however, generally take up more
data and are slower to update compared to B-tree indexes.
Because of all this, it's important make the following considerations when adding a new index:
- Do the new queries re-use as many existing indexes as possible?
- Is there enough data that using an index is faster than iterating over rows in the table?
- Is the overhead of maintaining the index worth the reduction in query timings?
Re-using Queries
The first step is to make sure your query re-uses as many existing indexes as possible. For example, consider the following query:
SELECT *
FROM todos
WHERE user_id = 123
AND state = 'open';
Now imagine we already have an index on the user_id
column but not on the
state
column. One may think this query performs badly due to state
being
unindexed. In reality the query may perform just fine given the index on
user_id
can filter out enough rows.
The best way to determine if indexes are re-used is to run your query using
EXPLAIN ANALYZE
. Depending on the joined tables and the columns being used for filtering,
you may find an extra index doesn't make much, if any, difference.
In short:
- Try to write your query in such a way that it re-uses as many existing indexes as possible.
- Run the query using
EXPLAIN ANALYZE
and study the output to find the most ideal query.
Data Size
A database may not use an index even when a regular sequence scan (iterating over all rows) is faster, especially for small tables.
Consider adding an index if a table is expected to grow, and your query has to filter a lot of rows.
You may not want to add an index if the table size is small (<1,000
records),
or if existing indexes already filter out enough rows.
Maintenance Overhead
Indexes have to be updated on every table write. In the case of PostgreSQL, all existing indexes are updated whenever data is written to a table. As a result, having many indexes on the same table slows down writes. It's therefore important to balance query performance with the overhead of maintaining an extra index.
Let's say that adding an index reduces SELECT timings by 5 milliseconds but increases INSERT/UPDATE/DELETE timings by 10 milliseconds. In this case, the new index may not be worth it. A new index is more valuable when SELECT timings are reduced and INSERT/UPDATE/DELETE timings are unaffected.
Finding Unused Indexes
To see which indexes are unused you can run the following query:
SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass))
FROM pg_stat_all_indexes
WHERE schemaname = 'public'
AND idx_scan = 0
AND idx_tup_read = 0
AND idx_tup_fetch = 0
ORDER BY pg_relation_size(indexrelname::regclass) desc;
This query outputs a list containing all indexes that are never used and sorts them by indexes sizes in descending order. This query helps in determining whether existing indexes are still required. More information on the meaning of the various columns can be found at https://www.postgresql.org/docs/current/monitoring-stats.html.
Because the query output relies on the actual usage of your database, it may be affected by factors such as:
- Certain queries never being executed, thus not being able to use certain indexes.
- Certain tables having little data, resulting in PostgreSQL using sequence scans instead of index scans.
This data is only reliable for a frequently used database with plenty of data, and using as many GitLab features as possible.
Requirements for naming indexes
Indexes with complex definitions must be explicitly named rather than relying on the implicit naming behavior of migration methods. In short, that means you must provide an explicit name argument for an index created with one or more of the following options:
where
using
order
length
type
opclass
Considerations for index names
Index names don't have any significance in the database, so they should attempt to communicate intent to others. The most important rule to remember is that generic names are more likely to conflict or be duplicated, and should not be used. Some other points to consider:
- For general indexes, use a template, like:
index_{table}_{column}_{options}
. - For indexes added to solve a very specific problem, it may make sense for the name to reflect their use.
- Identifiers in PostgreSQL have a maximum length of 63 bytes.
- Check
db/structure.sql
for conflicts and ideas.
Why explicit names are required
As Rails is database agnostic, it generates an index name only from the required options of all indexes: table name and column names. For example, imagine the following two indexes are created in a migration:
def up
add_index :my_table, :my_column
add_index :my_table, :my_column, where: 'my_column IS NOT NULL'
end
Creation of the second index would fail, because Rails would generate the same name for both indexes.
This naming issue is further complicated by the behavior of the index_exists?
method.
It considers only the table name, column names, and uniqueness specification
of the index when making a comparison. Consider:
def up
unless index_exists?(:my_table, :my_column, where: 'my_column IS NOT NULL')
add_index :my_table, :my_column, where: 'my_column IS NOT NULL'
end
end
The call to index_exists?
returns true if any index exists on
:my_table
and :my_column
, and index creation is bypassed.
The add_concurrent_index
helper is a requirement for creating indexes
on populated tables. Because it cannot be used inside a transactional
migration, it has a built-in check that detects if the index already
exists. In the event a match is found, index creation is skipped.
Without an explicit name argument, Rails can return a false positive
for index_exists?
, causing a required index to not be created
properly. By always requiring a name for certain types of indexes, the
chance of error is greatly reduced.
Temporary indexes
There may be times when an index is only needed temporarily.
For example, in a migration, a column of a table might be conditionally updated. To query which columns must be updated in the query performance guidelines, an index is needed that would otherwise not be used.
In these cases, consider a temporary index. To specify a temporary index:
- Prefix the index name with
tmp_
and follow the naming conventions and requirements for naming indexes for the rest of the name. - Create a follow-up issue to remove the index in the next (or future) milestone.
- Add a comment in the migration mentioning the removal issue.
A temporary migration would look like:
INDEX_NAME = 'tmp_index_projects_on_owner_where_emails_disabled'
def up
# Temporary index to be removed in 13.9 https://gitlab.com/gitlab-org/gitlab/-/issues/1234
add_concurrent_index :projects, :creator_id, where: 'emails_disabled = false', name: INDEX_NAME
end
def down
remove_concurrent_index_by_name :projects, INDEX_NAME
end
Create indexes asynchronously
For very large tables, index creation can be a challenge to manage.
While add_concurrent_index
creates indexes in a way that does not block
normal traffic, it can still be problematic when index creation runs for
many hours. Necessary database operations like autovacuum
cannot run, and
on GitLab.com, the deployment process is blocked waiting for index
creation to finish.
To limit impact on GitLab.com, a process exists to create indexes asynchronously during weekend hours. Due to generally lower traffic and fewer deployments, index creation can proceed at a lower level of risk.
Schedule index creation for a low-impact time
- Schedule the index to be created.
- Verify the MR was deployed and the index exists in production.
- Add a migration to create the index synchronously.
Schedule the index to be created
Create an MR with a post-deployment migration which prepares the index
for asynchronous creation. An example of creating an index using
the asynchronous index helpers can be seen in the block below. This migration
enters the index name and definition into the postgres_async_indexes
table. The process that runs on weekends pulls indexes from this
table and attempt to create them.
# in db/post_migrate/
INDEX_NAME = 'index_ci_builds_on_some_column'
def up
prepare_async_index :ci_builds, :some_column, name: INDEX_NAME
end
def down
unprepare_async_index :ci_builds, :some_column, name: INDEX_NAME
end
Verify the MR was deployed and the index exists in production
You can verify if the MR was deployed to GitLab.com by executing
/chatops run auto_deploy status <merge_sha>
. To verify existence of
the index, you can:
- Use a meta-command in #database-lab, such as:
\d <index_name>
.- Ensure that the index is not
invalid
.
- Ensure that the index is not
- Ask someone in #database to check if the index exists.
- With proper access, you can also verify directly on production or in a production clone.
Add a migration to create the index synchronously
After the index is verified to exist on the production database, create a second
merge request that adds the index synchronously. The schema changes must be
updated and committed to structure.sql
in this second merge request.
The synchronous migration results in a no-op on GitLab.com, but you should still add the
migration as expected for other installations. The below block
demonstrates how to create the second migration for the previous
asynchronous example.
WARNING:
Verify that the index exists in production before merging a second migration with add_concurrent_index
.
If the second migration is deployed before the index has been created,
the index is created synchronously when the second migration executes.
# in db/post_migrate/
INDEX_NAME = 'index_ci_builds_on_some_column'
disable_ddl_transaction!
def up
add_concurrent_index :ci_builds, :some_column, name: INDEX_NAME
end
def down
remove_concurrent_index_by_name :ci_builds, INDEX_NAME
end
Test database index changes locally
You must test the database index changes locally before creating a merge request.
Verify indexes created asynchronously
Use the asynchronous index helpers on your local environment to test changes for creating an index:
- Enable the feature flags by running
Feature.enable(:database_async_index_creation)
andFeature.enable(:database_reindexing)
in the Rails console. - Run
bundle exec rails db:migrate
so that it creates an entry in thepostgres_async_indexes
table. - Run
bundle exec rails gitlab:db:reindex
so that the index is created asynchronously. - To verify the index, open the PostgreSQL console using the GDK command
gdk psql
and run the command\d <index_name>
to check that your newly created index exists.