2020-10-30 14:08:56 -04:00
---
stage: none
group: unassigned
2020-11-26 01:09:20 -05:00
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
2020-10-30 14:08:56 -04:00
---
2019-08-22 08:12:28 -04:00
# Database case study: Namespaces storage statistics
2020-11-10 07:08:57 -05:00
## Introduction
2019-08-22 08:12:28 -04:00
On [Storage and limits management for groups ](https://gitlab.com/groups/gitlab-org/-/epics/886 ),
we want to facilitate a method for easily viewing the amount of
storage consumed by a group, and allow easy management.
2020-11-10 07:08:57 -05:00
## Proposal
2019-08-22 08:12:28 -04:00
2021-07-26 11:08:30 -04:00
1. Create a new ActiveRecord model to hold the namespaces' statistics in an aggregated form (only for root [namespaces ](../user/group/index.md#namespaces )).
2019-08-22 08:12:28 -04:00
1. Refresh the statistics in this model every time a project belonging to this namespace is changed.
2020-11-10 07:08:57 -05:00
## Problem
2019-08-22 08:12:28 -04:00
In GitLab, we update the project storage statistics through a
2021-06-08 14:10:23 -04:00
[callback ](https://gitlab.com/gitlab-org/gitlab/-/blob/4ab54c2233e91f60a80e5b6fa2181e6899fdcc3e/app/models/project.rb#L97 )
2019-08-22 08:12:28 -04:00
every time the project is saved.
The summary of those statistics per namespace is then retrieved
2021-06-08 14:10:23 -04:00
by [`Namespaces#with_statistics` ](https://gitlab.com/gitlab-org/gitlab/-/blob/4ab54c2233e91f60a80e5b6fa2181e6899fdcc3e/app/models/namespace.rb#L70 ) scope. Analyzing this query we noticed that:
2019-08-22 08:12:28 -04:00
2019-08-26 16:31:04 -04:00
- It takes up to `1.2` seconds for namespaces with over `15k` projects.
- It can't be analyzed with [ChatOps ](chatops_on_gitlabcom.md ), as it times out.
2019-08-22 08:12:28 -04:00
Additionally, the pattern that is currently used to update the project statistics
(the callback) doesn't scale adequately. It is currently one of the largest
2020-05-21 02:08:25 -04:00
[database queries transactions on production ](https://gitlab.com/gitlab-org/gitlab/-/issues/29070 )
2020-02-11 16:08:44 -05:00
that takes the most time overall. We can't add one more query to it as
2020-12-11 13:09:57 -05:00
it increases the transaction's length.
2019-08-22 08:12:28 -04:00
Because of all of the above, we can't apply the same pattern to store
and update the namespaces statistics, as the `namespaces` table is one
of the largest tables on GitLab.com. Therefore we needed to find a performant and
alternative method.
## Attempts
2020-11-10 07:08:57 -05:00
### Attempt A: PostgreSQL materialized view
2019-08-22 08:12:28 -04:00
2020-05-14 14:08:06 -04:00
Model can be updated through a refresh strategy based on a project routes SQL and a [materialized view ](https://www.postgresql.org/docs/11/rules-materializedviews.html ):
2019-08-22 08:12:28 -04:00
```sql
SELECT split_part("rs".path, '/', 1) as root_path,
COALESCE(SUM(ps.storage_size), 0) AS storage_size,
COALESCE(SUM(ps.repository_size), 0) AS repository_size,
COALESCE(SUM(ps.wiki_size), 0) AS wiki_size,
COALESCE(SUM(ps.lfs_objects_size), 0) AS lfs_objects_size,
COALESCE(SUM(ps.build_artifacts_size), 0) AS build_artifacts_size,
2021-10-28 08:10:22 -04:00
COALESCE(SUM(ps.pipeline_artifacts_size), 0) AS pipeline_artifacts_size,
COALESCE(SUM(ps.packages_size), 0) AS packages_size,
COALESCE(SUM(ps.snippets_size), 0) AS snippets_size,
COALESCE(SUM(ps.uploads_size), 0) AS uploads_size
2019-08-22 08:12:28 -04:00
FROM "projects"
INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project'
INNER JOIN project_statistics ps ON ps.project_id = projects.id
GROUP BY root_path
```
We could then execute the query with:
```sql
REFRESH MATERIALIZED VIEW root_namespace_storage_statistics;
```
While this implied a single query update (and probably a fast one), it has some downsides:
2019-08-26 16:31:04 -04:00
- Materialized views syntax varies from PostgreSQL and MySQL. While this feature was worked on, MySQL was still supported by GitLab.
- Rails does not have native support for materialized views. We'd need to use a specialized gem to take care of the management of the database views, which implies additional work.
2019-08-22 08:12:28 -04:00
2020-11-10 07:08:57 -05:00
### Attempt B: An update through a CTE
2019-08-22 08:12:28 -04:00
Similar to Attempt A: Model update done through a refresh strategy with a [Common Table Expression ](https://www.postgresql.org/docs/9.1/queries-with.html )
```sql
WITH refresh AS (
SELECT split_part("rs".path, '/', 1) as root_path,
COALESCE(SUM(ps.storage_size), 0) AS storage_size,
COALESCE(SUM(ps.repository_size), 0) AS repository_size,
COALESCE(SUM(ps.wiki_size), 0) AS wiki_size,
COALESCE(SUM(ps.lfs_objects_size), 0) AS lfs_objects_size,
COALESCE(SUM(ps.build_artifacts_size), 0) AS build_artifacts_size,
2021-10-28 08:10:22 -04:00
COALESCE(SUM(ps.pipeline_artifacts_size), 0) AS pipeline_artifacts_size,
COALESCE(SUM(ps.packages_size), 0) AS packages_size,
COALESCE(SUM(ps.snippets_size), 0) AS snippets_size,
COALESCE(SUM(ps.uploads_size), 0) AS uploads_size
2019-08-22 08:12:28 -04:00
FROM "projects"
INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project'
INNER JOIN project_statistics ps ON ps.project_id = projects.id
GROUP BY root_path)
UPDATE namespace_storage_statistics
SET storage_size = refresh.storage_size,
repository_size = refresh.repository_size,
wiki_size = refresh.wiki_size,
lfs_objects_size = refresh.lfs_objects_size,
build_artifacts_size = refresh.build_artifacts_size,
2021-10-28 08:10:22 -04:00
pipeline_artifacts_size = refresh.pipeline_artifacts_size,
packages_size = refresh.packages_size,
snippets_size = refresh.snippets_size,
uploads_size = refresh.uploads_size
2019-08-22 08:12:28 -04:00
FROM refresh
INNER JOIN routes rs ON rs.path = refresh.root_path AND rs.source_type = 'Namespace'
WHERE namespace_storage_statistics.namespace_id = rs.source_id
```
Same benefits and downsides as attempt A.
### Attempt C: Get rid of the model and store the statistics on Redis
We could get rid of the model that stores the statistics in aggregated form and instead use a Redis Set.
This would be the [boring solution ](https://about.gitlab.com/handbook/values/#boring-solutions ) and the fastest one
to implement, as GitLab already includes Redis as part of its [Architecture ](architecture.md#redis ).
The downside of this approach is that Redis does not provide the same persistence/consistency guarantees as PostgreSQL,
and this is information we can't afford to lose in a Redis failure.
### Attempt D: Tag the root namespace and its child namespaces
Directly relate the root namespace to its child namespaces, so
whenever a namespace is created without a parent, this one is tagged
with the root namespace ID:
2020-05-07 02:09:38 -04:00
| ID | root ID | parent ID |
|:---|:--------|:----------|
| 1 | 1 | NULL |
| 2 | 1 | 1 |
| 3 | 1 | 2 |
2019-08-22 08:12:28 -04:00
To aggregate the statistics inside a namespace we'd execute something like:
```sql
SELECT COUNT(...)
FROM projects
WHERE namespace_id IN (
SELECT id
FROM namespaces
WHERE root_id = X
)
```
Even though this approach would make aggregating much easier, it has some major downsides:
2020-12-11 13:09:57 -05:00
- We'd have to migrate **all namespaces** by adding and filling a new column. Because of the size of the table, dealing with time/cost would be significant. The background migration would take approximately `153h` , see < https: // gitlab . com / gitlab-org / gitlab-foss / - / merge_requests / 29772 > .
2019-08-26 16:31:04 -04:00
- Background migration has to be shipped one release before, delaying the functionality by another milestone.
2019-08-22 08:12:28 -04:00
2021-02-18 19:11:06 -05:00
### Attempt E (final): Update the namespace storage statistics asynchronously
2019-08-22 08:12:28 -04:00
2020-12-11 13:09:57 -05:00
This approach consists of continuing to use the incremental statistics updates we already have,
2019-08-22 08:12:28 -04:00
but we refresh them through Sidekiq jobs and in different transactions:
1. Create a second table (`namespace_aggregation_schedules`) with two columns `id` and `namespace_id` .
1. Whenever the statistics of a project changes, insert a row into `namespace_aggregation_schedules`
- We don't insert a new row if there's already one related to the root namespace.
2020-05-21 02:08:25 -04:00
- Keeping in mind the length of the transaction that involves updating `project_statistics` (< https: // gitlab . com / gitlab-org / gitlab / - / issues / 29070 > ), the insertion should be done in a different transaction and through a Sidekiq Job.
2021-02-18 19:11:06 -05:00
1. After inserting the row, we schedule another worker to be executed asynchronously at two different moments:
2019-08-22 08:12:28 -04:00
- One enqueued for immediate execution and another one scheduled in `1.5h` hours.
- We only schedule the jobs, if we can obtain a `1.5h` lease on Redis on a key based on the root namespace ID.
- If we can't obtain the lease, it indicates there's another aggregation already in progress, or scheduled in no more than `1.5h` .
1. This worker will:
- Update the root namespace storage statistics by querying all the namespaces through a service.
- Delete the related `namespace_aggregation_schedules` after the update.
1. Another Sidekiq job is also included to traverse any remaining rows on the `namespace_aggregation_schedules` table and schedule jobs for every pending row.
- This job is scheduled with cron to run every night (UTC).
This implementation has the following benefits:
2021-02-18 19:11:06 -05:00
- All the updates are done asynchronously, so we're not increasing the length of the transactions for `project_statistics` .
2019-08-26 16:31:04 -04:00
- We're doing the update in a single SQL query.
- It is compatible with PostgreSQL and MySQL.
- No background migration required.
2019-08-22 08:12:28 -04:00
The only downside of this approach is that namespaces' statistics are updated up to `1.5` hours after the change is done,
which means there's a time window in which the statistics are inaccurate. Because we're still not
2020-05-21 02:08:25 -04:00
[enforcing storage limits ](https://gitlab.com/gitlab-org/gitlab/-/issues/17664 ), this is not a major problem.
2019-08-22 08:12:28 -04:00
2020-11-10 07:08:57 -05:00
## Conclusion
2019-08-22 08:12:28 -04:00
Updating the storage statistics asynchronously, was the less problematic and
performant approach of aggregating the root namespaces.
All the details regarding this use case can be found on:
2020-05-21 02:08:25 -04:00
- < https: // gitlab . com / gitlab-org / gitlab-foss / - / issues / 62214 >
2020-02-06 10:09:11 -05:00
- Merge Request with the implementation: < https: // gitlab . com / gitlab-org / gitlab-foss / - / merge_requests / 28996 >
2019-08-22 08:12:28 -04:00
Performance of the namespace storage statistics were measured in staging and production (GitLab.com). All results were posted
2020-05-21 02:08:25 -04:00
on < https: / / gitlab . com / gitlab-org / gitlab-foss / - / issues / 64092 > : No problem has been reported so far.