gitlab-org--gitlab-foss/doc/administration/pseudonymizer.md

---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Pseudonymizer **(ULTIMATE)**

> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/5532) in [GitLab Ultimate](https://about.gitlab.com/pricing/) 11.1.

As GitLab's database hosts sensitive information, using it unfiltered for analytics
implies high security requirements. To help alleviate this constraint, the Pseudonymizer
service is used to export GitLab's data in a pseudonymized way.

CAUTION: **Warning:**
This process is not impervious. If the source data is available, it's possible for
a user to correlate data to the pseudonymized version.

The Pseudonymizer currently uses `HMAC(SHA256)` to mutate fields that shouldn't
be textually exported. This ensures that:

- the end-user of the data source cannot infer/revert the pseudonymized fields
- the referential integrity is maintained

## Configuration

To configure the pseudonymizer, you need to:

- Provide a manifest file that describes which fields should be included or
  pseudonymized ([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab/tree/master/config/pseudonymizer.yml)).
  A default manifest is provided with the GitLab installation, using a relative file path that resolves from the Rails root.
  Alternatively, you can use an absolute file path.
- Use an object storage and specify the connection parameters in the `pseudonymizer.upload.connection` configuration option.

[Read more about using object storage with GitLab](object_storage.md).

**For Omnibus installations:**

1. Edit `/etc/gitlab/gitlab.rb` and add the following lines by replacing with
   the values you want:

   ```ruby
   gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'
   gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name
   gitlab_rails['pseudonymizer_upload_connection'] = {
     'provider' => 'AWS',
     'region' => 'eu-central-1',
     'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',
     'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'
   }
   ```

   If you are using AWS IAM profiles, be sure to omit the AWS access key and secret access key/value pairs.

   ```ruby
   gitlab_rails['pseudonymizer_upload_connection'] = {
     'provider' => 'AWS',
     'region' => 'eu-central-1',
     'use_iam_profile' => true
   }
   ```

1. Save the file and [reconfigure GitLab](restart_gitlab.md#omnibus-gitlab-reconfigure)
   for the changes to take effect.

---

**For installations from source:**

1. Edit `/home/git/gitlab/config/gitlab.yml` and add or amend the following
   lines:

   ```yaml
   pseudonymizer:
     manifest: config/pseudonymizer.yml
     upload:
       remote_directory: 'gitlab-elt' # bucket name
       connection:
         provider: AWS
         aws_access_key_id: AWS_ACCESS_KEY_ID
         aws_secret_access_key: AWS_SECRET_ACCESS_KEY
         region: eu-central-1
   ```

1. Save the file and [restart GitLab](restart_gitlab.md#installations-from-source)
   for the changes to take effect.

## Usage

You can optionally run the pseudonymizer using the following environment variables:

- `PSEUDONYMIZER_OUTPUT_DIR` - where to store the output CSV files (defaults to `/tmp`)
- `PSEUDONYMIZER_BATCH` - the batch size when querying the DB (defaults to `100000`)

```shell
## Omnibus
sudo gitlab-rake gitlab:db:pseudonymizer

## Source
sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production
```

This produces some CSV files that might be very large, so make sure the
`PSEUDONYMIZER_OUTPUT_DIR` has sufficient space. As a rule of thumb, at least
10% of the database size is recommended.

After the pseudonymizer has run, the output CSV files should be uploaded to the
configured object storage and deleted from the local disk.
Add latest changes from gitlab-org/gitlab@master 2020-10-23 20:08:35 -04:00			`---`
			`stage: none`
			`group: unassigned`
Add latest changes from gitlab-org/gitlab@master 2020-11-26 01:09:20 -05:00			`info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments`
Add latest changes from gitlab-org/gitlab@master 2020-10-23 20:08:35 -04:00			`---`

Changing badges to use parentheses not brackets Previously, we used brackets to denote the tier badges, but this made Kramdown, the docs site Markdown renderer, show many warnings when building the site. This is now fixed by using parentheses instead of square brackets. This was caused by [PREMIUM] looking like a link to Kramdown, which couldn't find a URL there. See: - https://gitlab.com/gitlab-com/gitlab-docs/merge_requests/484 - https://gitlab.com/gitlab-org/gitlab-ce/issues/63800 2019-07-08 04:50:38 -04:00			`# Pseudonymizer (ULTIMATE)`
Docs: Merge Misc EE doc/administration files and dirs to CE 2019-05-05 11:21:25 -04:00
Add latest changes from gitlab-org/gitlab@master 2020-04-21 11:21:10 -04:00			`> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/5532) in [GitLab Ultimate](https://about.gitlab.com/pricing/) 11.1.`
Docs: Merge Misc EE doc/administration files and dirs to CE 2019-05-05 11:21:25 -04:00
			`As GitLab's database hosts sensitive information, using it unfiltered for analytics`
			`implies high security requirements. To help alleviate this constraint, the Pseudonymizer`
			`service is used to export GitLab's data in a pseudonymized way.`

			`CAUTION: Warning:`
			`This process is not impervious. If the source data is available, it's possible for`
			`a user to correlate data to the pseudonymized version.`

			The Pseudonymizer currently uses `HMAC(SHA256)` to mutate fields that shouldn't
			`be textually exported. This ensures that:`

			`- the end-user of the data source cannot infer/revert the pseudonymized fields`
			`- the referential integrity is maintained`

			`## Configuration`

			`To configure the pseudonymizer, you need to:`

			`- Provide a manifest file that describes which fields should be included or`
Add latest changes from gitlab-org/gitlab@master 2019-09-18 10:02:45 -04:00			pseudonymized ([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab/tree/master/config/pseudonymizer.yml)).
Add latest changes from gitlab-org/gitlab@master 2020-11-19 16:09:07 -05:00			`A default manifest is provided with the GitLab installation, using a relative file path that resolves from the Rails root.`
Docs: Merge Misc EE doc/administration files and dirs to CE 2019-05-05 11:21:25 -04:00			`Alternatively, you can use an absolute file path.`
Remove extra whitespaces in docs Removes all the extra whitespaces at end of lines, inside tags, and removes extra newlines 2019-07-08 23:28:41 -04:00			- Use an object storage and specify the connection parameters in the `pseudonymizer.upload.connection` configuration option.
Docs: Merge Misc EE doc/administration files and dirs to CE 2019-05-05 11:21:25 -04:00
Add latest changes from gitlab-org/gitlab@master 2020-04-09 11:09:29 -04:00			`[Read more about using object storage with GitLab](object_storage.md).`

Docs: Merge Misc EE doc/administration files and dirs to CE 2019-05-05 11:21:25 -04:00			`For Omnibus installations:`

			1. Edit `/etc/gitlab/gitlab.rb` and add the following lines by replacing with
			`the values you want:`

Fix spacing of code blocks Code blocks should not be spaced 4 times, as this will prevent the code from being colored. They should also be spaced the same as the lists they are a part of, to make reading easier. 2019-07-09 03:16:17 -04:00			```ruby
			`gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'`
			`gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name`
			`gitlab_rails['pseudonymizer_upload_connection'] = {`
			`'provider' => 'AWS',`
			`'region' => 'eu-central-1',`
			`'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',`
			`'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'`
			`}`
			```

			`If you are using AWS IAM profiles, be sure to omit the AWS access key and secret access key/value pairs.`

			```ruby
			`gitlab_rails['pseudonymizer_upload_connection'] = {`
			`'provider' => 'AWS',`
			`'region' => 'eu-central-1',`
			`'use_iam_profile' => true`
			`}`
			```
Docs: Merge Misc EE doc/administration files and dirs to CE 2019-05-05 11:21:25 -04:00
			`1. Save the file and [reconfigure GitLab](restart_gitlab.md#omnibus-gitlab-reconfigure)`
			`for the changes to take effect.`

			`---`

			`For installations from source:`

			1. Edit `/home/git/gitlab/config/gitlab.yml` and add or amend the following
			`lines:`

Fix spacing of code blocks Code blocks should not be spaced 4 times, as this will prevent the code from being colored. They should also be spaced the same as the lists they are a part of, to make reading easier. 2019-07-09 03:16:17 -04:00			```yaml
			`pseudonymizer:`
Remove hard tabs from docs Hard tabs do not follow general markdown guidelines are were removed from the few docs that used them 2019-07-11 18:53:54 -04:00			`manifest: config/pseudonymizer.yml`
			`upload:`
Fix spacing of code blocks Code blocks should not be spaced 4 times, as this will prevent the code from being colored. They should also be spaced the same as the lists they are a part of, to make reading easier. 2019-07-09 03:16:17 -04:00			`remote_directory: 'gitlab-elt' # bucket name`
			`connection:`
			`provider: AWS`
			`aws_access_key_id: AWS_ACCESS_KEY_ID`
			`aws_secret_access_key: AWS_SECRET_ACCESS_KEY`
			`region: eu-central-1`
			```
Docs: Merge Misc EE doc/administration files and dirs to CE 2019-05-05 11:21:25 -04:00
			`1. Save the file and [restart GitLab](restart_gitlab.md#installations-from-source)`
			`for the changes to take effect.`

			`## Usage`

			`You can optionally run the pseudonymizer using the following environment variables:`

			- `PSEUDONYMIZER_OUTPUT_DIR` - where to store the output CSV files (defaults to `/tmp`)
			- `PSEUDONYMIZER_BATCH` - the batch size when querying the DB (defaults to `100000`)

Add latest changes from gitlab-org/gitlab@master 2020-01-30 10:09:15 -05:00			```shell
Docs: Merge Misc EE doc/administration files and dirs to CE 2019-05-05 11:21:25 -04:00			`## Omnibus`
			`sudo gitlab-rake gitlab:db:pseudonymizer`

			`## Source`
			`sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production`
			```

Add latest changes from gitlab-org/gitlab@master 2020-11-19 16:09:07 -05:00			`This produces some CSV files that might be very large, so make sure the`
Docs: Merge Misc EE doc/administration files and dirs to CE 2019-05-05 11:21:25 -04:00			`PSEUDONYMIZER_OUTPUT_DIR` has sufficient space. As a rule of thumb, at least
			`10% of the database size is recommended.`

			`After the pseudonymizer has run, the output CSV files should be uploaded to the`
			`configured object storage and deleted from the local disk.`