diff --git a/doc/development/README.md b/doc/development/README.md index a74770ae383..45efa744a6b 100644 --- a/doc/development/README.md +++ b/doc/development/README.md @@ -111,6 +111,10 @@ description: 'Learn how to contribute to GitLab.' - [Database helper modules](database_helpers.md) - [Code comments](code_comments.md) +## Case studies + +- [Database case study: Filtering by label](filtering_by_label.md) + ## Integration guides - [Jira Connect app](integrations/jira_connect.md) diff --git a/doc/development/filtering_by_label.md b/doc/development/filtering_by_label.md new file mode 100644 index 00000000000..6e6b71b1787 --- /dev/null +++ b/doc/development/filtering_by_label.md @@ -0,0 +1,166 @@ +# Filtering by label + +## Introduction + +GitLab has [labels](../user/project/labels.md) that can be assigned to issues, +merge requests, and epics. Labels on those objects are a many-to-many relation +through the polymorphic `label_links` table. + +To filter these objects by multiple labels - for instance, 'all open +issues with the label ~Plan and the label ~backend' - we generate a +query containing a `GROUP BY` clause. In a simple form, this looks like: + +```sql +SELECT + issues.* +FROM + issues + INNER JOIN label_links ON label_links.target_id = issues.id + AND label_links.target_type = 'Issue' + INNER JOIN labels ON labels.id = label_links.label_id +WHERE + issues.project_id = 13083 + AND (issues.state IN ('opened')) + AND labels.title IN ('Plan', + 'backend') +GROUP BY + issues.id +HAVING (COUNT(DISTINCT labels.title) = 2) +ORDER BY + issues.updated_at DESC, + issues.id DESC +LIMIT 20 OFFSET 0 +``` + +In particular, note that: + +1. We `GROUP BY issues.id` so that we can ... +2. Use the `HAVING (COUNT(DISTINCT labels.title) = 2)` condition to ensure that + all matched issues have both labels. + +This is more complicated than is ideal. It makes the query construction more +prone to errors (such as +[gitlab-org/gitlab-ce#15557](https://gitlab.com/gitlab-org/gitlab-ce/issues/15557)). + +## Attempt A: WHERE EXISTS + +### Attempt A1: use multiple subqueries with WHERE EXISTS + +In +[gitlab-org/gitlab-ce#37137](https://gitlab.com/gitlab-org/gitlab-ce/issues/37137) +and its associated merge request +[gitlab-org/gitlab-ce!14022](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14022), +we tried to replace the `GROUP BY` with multiple uses of `WHERE EXISTS`. For the +example above, this would give: + +```sql +WHERE (EXISTS ( + SELECT + TRUE + FROM + label_links + INNER JOIN labels ON labels.id = label_links.label_id + WHERE + labels.title = 'Plan' + AND target_type = 'Issue' + AND target_id = issues.id)) +AND (EXISTS ( + SELECT + TRUE + FROM + label_links + INNER JOIN labels ON labels.id = label_links.label_id + WHERE + labels.title = 'backend' + AND target_type = 'Issue' + AND target_id = issues.id)) +``` + +While this worked without schema changes, and did improve readability somewhat, +it did not improve query performance. + +## Attempt B: Denormalize using an array column + +Having [removed MySQL support in GitLab +12.1](https://about.gitlab.com/2019/06/27/removing-mysql-support/), using +[Postgres's arrays](https://www.postgresql.org/docs/9.6/arrays.html) became more +tractable as we didn't have to support two databases. We discussed denormalizing +the `label_links` table for querying in +[gitlab-org/gitlab-ce#49651](https://gitlab.com/gitlab-org/gitlab-ce/issues/49651), +with two options: label IDs and titles. + +We can think of both of those as array columns on `issues`, `merge_requests`, +and `epics`: `issues.label_ids` would be an array column of label IDs, and +`issues.label_titles` would be an array of label titles. + +These array columns can be complemented with [GIN +indexes](https://www.postgresql.org/docs/9.6/gin-intro.html) to improve +matching. + +### Attempt B1: store label IDs for each object + +This has some strong advantages over titles: + +1. Unless a label is deleted, or a project is moved, we never need to + bulk-update the denormalized column. +2. It uses less storage than the titles. + +Unfortunately, our application design makes this hard. If we were able to query +just by label ID easily, we wouldn't need the `INNER JOIN labels` in the initial +query at the start of this document. GitLab allows users to filter by label +title across projects and even across groups, so a filter by the label ~Plan may +include labels with multiple distinct IDs. + +We do not want users to have to know about the different IDs, which means that +given this data set: + +| Project | ~Plan label ID | ~backend label ID | +| --- | --- | --- | +| A | 11 | 12 | +| B | 21 | 22 | +| C | 31 | 32 | + +We would need something like: + +```sql +WHERE + label_ids @> ARRAY[11, 12] + OR label_ids @> ARRAY[21, 22] + OR label_ids @> ARRAY[31, 32] +``` + +This can get even more complicated when we consider that in some cases, there +might be two ~backend labels - with different IDs - that could apply to the same +object, so the number of combinations would balloon further. + +### Attempt B2: store label titles for each object + +From the perspective of updating the labelable object, this is the worst +option. We have to bulk update the objects when: + +1. The objects are moved from one project to another. +1. The project is moved from one group to another. +1. The label is renamed. +1. The label is deleted. + +It also uses much more storage. Querying is simple, though: + +```sql +WHERE + label_titles @> ARRAY['Plan', 'backend'] +``` + +And our [tests in +gitlab-org/gitlab-ce#49651](https://gitlab.com/gitlab-org/gitlab-ce/issues/49651#note_188777346) +showed that this could be fast. + +However, at present, the disadvantages outweigh the advantages. + +## Conclusion + +We have yet to find a method that is demonstratably better than the current +method, when considering: + +1. Query performance. +1. Readability. +1. Ease of maintaining schema consistency.