Merge branch 'docs-label-filtering' into 'master'

Add development documentation on label filtering Closes #49651 See merge request gitlab-org/gitlab-ce!30741
2019-07-26 09:53:53 +00:00 · 2019-07-26 09:53:53 +00:00 · 0d2b5bff8c
parent 59a5a89c61 1ec7dae9ad
commit 0d2b5bff8c
2 changed files with 170 additions and 0 deletions
--- a/doc/development/README.md
+++ b/doc/development/README.md
@ -113,6 +113,10 @@ description: 'Learn how to contribute to GitLab.'
 - [Database helper modules](database_helpers.md)
 - [Code comments](code_comments.md)
 ## Case studies
 - [Database case study: Filtering by label](filtering_by_label.md)
 ## Integration guides
 - [Jira Connect app](integrations/jira_connect.md)
--- a/doc/development/filtering_by_label.md
+++ b/doc/development/filtering_by_label.md
@ -0,0 +1,166 @@
 # Filtering by label
 ## Introduction
 GitLab has [labels](../user/project/labels.md) that can be assigned to issues,
 merge requests, and epics. Labels on those objects are a many-to-many relation
 through the polymorphic `label_links` table.
 To filter these objects by multiple labels - for instance, 'all open
 issues with the label ~Plan and the label ~backend' - we generate a
 query containing a `GROUP BY` clause. In a simple form, this looks like:
 ```sql
 SELECT
    issues.*
 FROM
    issues
    INNER JOIN label_links ON label_links.target_id = issues.id
        AND label_links.target_type = 'Issue'
    INNER JOIN labels ON labels.id = label_links.label_id
 WHERE
    issues.project_id = 13083
    AND (issues.state IN ('opened'))
    AND labels.title IN ('Plan',
        'backend')
 GROUP BY
    issues.id
 HAVING (COUNT(DISTINCT labels.title) = 2)
 ORDER BY
    issues.updated_at DESC,
    issues.id DESC
 LIMIT 20 OFFSET 0
 ```
 In particular, note that:
 1. We `GROUP BY issues.id` so that we can ...
 2. Use the `HAVING (COUNT(DISTINCT labels.title) = 2)` condition to ensure that
   all matched issues have both labels.
 This is more complicated than is ideal. It makes the query construction more
 prone to errors (such as
 [gitlab-org/gitlab-ce#15557](https://gitlab.com/gitlab-org/gitlab-ce/issues/15557)).
 ## Attempt A: WHERE EXISTS
 ### Attempt A1: use multiple subqueries with WHERE EXISTS
 In
 [gitlab-org/gitlab-ce#37137](https://gitlab.com/gitlab-org/gitlab-ce/issues/37137)
 and its associated merge request
 [gitlab-org/gitlab-ce!14022](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14022),
 we tried to replace the `GROUP BY` with multiple uses of `WHERE EXISTS`. For the
 example above, this would give:
 ```sql
 WHERE (EXISTS (
        SELECT
            TRUE
        FROM
            label_links
            INNER JOIN labels ON labels.id = label_links.label_id
        WHERE
            labels.title = 'Plan'
            AND target_type = 'Issue'
            AND target_id = issues.id))
 AND (EXISTS (
        SELECT
            TRUE
        FROM
            label_links
            INNER JOIN labels ON labels.id = label_links.label_id
        WHERE
            labels.title = 'backend'
            AND target_type = 'Issue'
            AND target_id = issues.id))
 ```
 While this worked without schema changes, and did improve readability somewhat,
 it did not improve query performance.
 ## Attempt B: Denormalize using an array column
 Having [removed MySQL support in GitLab
 12.1](https://about.gitlab.com/2019/06/27/removing-mysql-support/), using
 [Postgres's arrays](https://www.postgresql.org/docs/9.6/arrays.html) became more
 tractable as we didn't have to support two databases. We discussed denormalizing
 the `label_links` table for querying in
 [gitlab-org/gitlab-ce#49651](https://gitlab.com/gitlab-org/gitlab-ce/issues/49651),
 with two options: label IDs and titles.
 We can think of both of those as array columns on `issues`, `merge_requests`,
 and `epics`: `issues.label_ids` would be an array column of label IDs, and
 `issues.label_titles` would be an array of label titles.
 These array columns can be complemented with [GIN
 indexes](https://www.postgresql.org/docs/9.6/gin-intro.html) to improve
 matching.
 ### Attempt B1: store label IDs for each object
 This has some strong advantages over titles:
 1. Unless a label is deleted, or a project is moved, we never need to
   bulk-update the denormalized column.
 2. It uses less storage than the titles.
 Unfortunately, our application design makes this hard. If we were able to query
 just by label ID easily, we wouldn't need the `INNER JOIN labels` in the initial
 query at the start of this document. GitLab allows users to filter by label
 title across projects and even across groups, so a filter by the label ~Plan may
 include labels with multiple distinct IDs.
 We do not want users to have to know about the different IDs, which means that
 given this data set:
 | Project | ~Plan label ID | ~backend label ID |
 | ---     | ---            | ---               |
 | A       | 11             | 12                |
 | B       | 21             | 22                |
 | C       | 31             | 32                |
 We would need something like:
 ```sql
 WHERE
    label_ids @> ARRAY[11, 12]
    OR label_ids @> ARRAY[21, 22]
    OR label_ids @> ARRAY[31, 32]
 ```
 This can get even more complicated when we consider that in some cases, there
 might be two ~backend labels - with different IDs - that could apply to the same
 object, so the number of combinations would balloon further.
 ### Attempt B2: store label titles for each object
 From the perspective of updating the labelable object, this is the worst
 option. We have to bulk update the objects when:
 1. The objects are moved from one project to another.
 1. The project is moved from one group to another.
 1. The label is renamed.
 1. The label is deleted.
 It also uses much more storage. Querying is simple, though:
 ```sql
 WHERE
    label_titles @> ARRAY['Plan', 'backend']
 ```
 And our [tests in
 gitlab-org/gitlab-ce#49651](https://gitlab.com/gitlab-org/gitlab-ce/issues/49651#note_188777346)
 showed that this could be fast.
 However, at present, the disadvantages outweigh the advantages.
 ## Conclusion
 We have yet to find a method that is demonstratably better than the current
 method, when considering:
 1. Query performance.
 1. Readability.
 1. Ease of maintaining schema consistency.