gitlab-org--gitlab-foss/doc/administration/pseudonymizer.md

4.6 KiB

stage group info
Enablement Distribution To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments

Pseudonymizer (ULTIMATE)

Your GitLab database contains sensitive information. To protect sensitive information when you run analytics on your database, you can use the Pseudonymizer service, which:

  1. Uses HMAC(SHA256) to mutate fields containing sensitive information.
  2. Preserves references (referential integrity) between fields.
  3. Exports your GitLab data, scrubbed of sensitive material.

WARNING: If the source data is available, users can compare and correlate the scrubbed data with the original.

To generate a pseudonymized data set:

  1. Configure Pseudonymizer fields and output location.
  2. Enable Pseudonymizer data collection.
  3. Optional. Generate a data set manually.

Configure Pseudonymizer

To use the Pseudonymizer, configure both the fields you want to anonymize, and the location to store the scrubbed data:

  1. Create a manifest file: This file describes the fields to include or pseudonymize.
    • Default manifest - GitLab provides a default manifest in your GitLab installation (example manifest.yml file). To use the example manifest file, use the config/pseudonymizer.yml relative path when you configure connection parameters.
    • Custom manifest - To use a custom manifest file, use the absolute path to the file when you configure the connection parameters.
  2. Configure connection parameters: In the configuration method appropriate for your version of GitLab, specify the object storage connection parameters (pseudonymizer.upload.connection).

For Omnibus installations:

  1. Edit /etc/gitlab/gitlab.rb and add the following lines by replacing with the values you want:

    gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'
    gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name
    gitlab_rails['pseudonymizer_upload_connection'] = {
      'provider' => 'AWS',
      'region' => 'eu-central-1',
      'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',
      'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'
    }
    

    If you are using AWS IAM profiles, omit the AWS access key and secret access key/value pairs.

    gitlab_rails['pseudonymizer_upload_connection'] = {
      'provider' => 'AWS',
      'region' => 'eu-central-1',
      'use_iam_profile' => true
    }
    
  2. Save the file and reconfigure GitLab for the changes to take effect.


For installations from source:

  1. Edit /home/git/gitlab/config/gitlab.yml and add or amend the following lines:

    pseudonymizer:
      manifest: config/pseudonymizer.yml
      upload:
        remote_directory: 'gitlab-elt' # bucket name
        connection:
          provider: AWS
          aws_access_key_id: AWS_ACCESS_KEY_ID
          aws_secret_access_key: AWS_SECRET_ACCESS_KEY
          region: eu-central-1
    
  2. Save the file and restart GitLab for the changes to take effect.

Enable Pseudonymizer data collection

To enable data collection:

  1. On the top bar, select Menu > Admin.
  2. On the left sidebar, select Settings > Metrics and Profiling, then expand Pseudonymizer data collection.
  3. Select Enable Pseudonymizer data collection.
  4. Select Save changes.

Generate data set manually

You can also run the Pseudonymizer manually:

  1. Set these environment variables:
    • PSEUDONYMIZER_OUTPUT_DIR - Where to store the output CSV files. Defaults to /tmp. These commands produce CSV files that can be quite large. Make sure the directory can store a file at least 10% of the size of your database.
    • PSEUDONYMIZER_BATCH - The batch size when querying the database. Defaults to 100000.
  2. Run the command appropriate for your application:
    • Omnibus GitLab: sudo gitlab-rake gitlab:db:pseudonymizer
    • Installations from source: sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production

After you run the command, upload the output CSV files to your configured object storage. After the upload completes, delete the output file from the local disk.