gitlab-org--gitlab-foss/doc/development/background_migrations.md

20 KiB

type stage group info
reference, dev none Development See the Technical Writers assigned to Development Guidelines: https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments-to-development-guidelines

Background migrations

Background migrations should be used to perform data migrations whenever a migration exceeds the time limits in our guidelines. For example, you can use background migrations to migrate data that's stored in a single JSON column to a separate table instead.

If the database cluster is considered to be in an unhealthy state, background migrations automatically reschedule themselves for a later point in time.

When To Use Background Migrations

You should use a background migration when you migrate data in tables that have so many rows that the process would exceed the time limits in our guidelines if performed using a regular Rails migration.

  • Background migrations should be used when migrating data in high-traffic tables.
  • Background migrations may also be used when executing numerous single-row queries for every item on a large dataset. Typically, for single-record patterns, runtime is largely dependent on the size of the dataset, hence it should be split accordingly and put into background migrations.
  • Background migrations should not be used to perform schema migrations.

Some examples where background migrations can be useful:

  • Migrating events from one table to multiple separate tables.
  • Populating one column based on JSON stored in another column.
  • Migrating data that depends on the output of external services (for example, an API).

NOTE: If the background migration is part of an important upgrade, make sure it's announced in the release post. Discuss with your Project Manager if you're not sure the migration falls into this category.

Isolation

Background migrations must be isolated and can not use application code (for example, models defined in app/models). Since these migrations can take a long time to run it's possible for new versions to be deployed while they are still running.

It's also possible for different migrations to be executed at the same time. This means that different background migrations should not migrate data in a way that would cause conflicts.

Idempotence

Background migrations are executed in a context of a Sidekiq process. Usual Sidekiq rules apply, especially the rule that jobs should be small and idempotent.

See Sidekiq best practices guidelines for more details.

Make sure that in case that your migration job is going to be retried data integrity is guaranteed.

Background migrations for EE-only features

All the background migration classes for EE-only features should be present in GitLab CE. For this purpose, an empty class can be created for GitLab CE, and it can be extended for GitLab EE as explained in the guidelines for implementing Enterprise Edition features.

How It Works

Background migrations are simple classes that define a perform method. A Sidekiq worker will then execute such a class, passing any arguments to it. All migration classes must be defined in the namespace Gitlab::BackgroundMigration, the files should be placed in the directory lib/gitlab/background_migration/.

Scheduling

Scheduling a background migration should be done in a post-deployment migration that includes Gitlab::Database::MigrationHelpers To do so, simply use the following code while replacing the class name and arguments with whatever values are necessary for your migration:

migrate_in('BackgroundMigrationClassName', [arg1, arg2, ...])

You can use the function queue_background_migration_jobs_by_range_at_intervals to automatically split the job into batches:

queue_background_migration_jobs_by_range_at_intervals(
  ClassName,
  BackgroundMigrationClassName,
  2.minutes,
  batch_size: 10_000
  )

You'll also need to make sure that newly created data is either migrated, or saved in both the old and new version upon creation. For complex and time consuming migrations it's best to schedule a background job using an after_create hook so this doesn't affect response timings. The same applies to updates. Removals in turn can be handled by simply defining foreign keys with cascading deletes.

Rescheduling background migrations

If one of the background migrations contains a bug that is fixed in a patch release, the background migration needs to be rescheduled so the migration would be repeated on systems that already performed the initial migration.

When you reschedule the background migration, make sure to turn the original scheduling into a no-op by clearing up the #up and #down methods of the migration performing the scheduling. Otherwise the background migration would be scheduled multiple times on systems that are upgrading multiple patch releases at once.

When you start the second post-deployment migration, you should delete any previously queued jobs from the initial migration with the provided helper:

delete_queued_jobs('BackgroundMigrationClassName')

Cleaning Up

NOTE: Cleaning up any remaining background migrations must be done in either a major or minor release, you must not do this in a patch release.

Because background migrations can take a long time you can't immediately clean things up after scheduling them. For example, you can't drop a column that's used in the migration process as this would cause jobs to fail. This means that you'll need to add a separate post deployment migration in a future release that finishes any remaining jobs before cleaning things up (for example, removing a column).

As an example, say you want to migrate the data from column foo (containing a big JSON blob) to column bar (containing a string). The process for this would roughly be as follows:

  1. Release A:
    1. Create a migration class that performs the migration for a row with a given ID. You can use background jobs tracking to simplify cleaning up.
    2. Deploy the code for this release, this should include some code that will schedule jobs for newly created data (for example, using an after_create hook).
    3. Schedule jobs for all existing rows in a post-deployment migration. It's possible some newly created rows may be scheduled twice so your migration should take care of this.
  2. Release B:
    1. Deploy code so that the application starts using the new column and stops scheduling jobs for newly created data.
    2. In a post-deployment migration, finalize all jobs that have not succeeded by now. If you used background jobs tracking in release A, you can use finalize_background_migration from BackgroundMigrationHelpers to ensure no jobs remain. This helper will:
      1. Use Gitlab::BackgroundMigration.steal to process any remaining jobs in Sidekiq.
      2. Reschedule the migration to be run directly (that is, not through Sidekiq) on any rows that weren't migrated by Sidekiq. This can happen if, for instance, Sidekiq received a SIGKILL, or if a particular batch failed enough times to be marked as dead.
      3. Remove Gitlab::Database::BackgroundMigrationJob rows where status = succeeded. To retain diagnostic information that may help with future bug tracking you can skip this step by specifying the delete_tracking_jobs: false parameter.
    3. Remove the old column.

This may also require a bump to the import/export version, if importing a project from a prior version of GitLab requires the data to be in the new format.

Example

To explain all this, let's use the following example: the table integrations has a field called properties which is stored in JSON. For all rows you want to extract the url key from this JSON object and store it in the integrations.url column. There are millions of integrations and parsing JSON is slow, thus you can't do this in a regular migration.

To do this using a background migration we'll start with defining our migration class:

class Gitlab::BackgroundMigration::ExtractIntegrationsUrl
  class Integration < ActiveRecord::Base
    self.table_name = 'integrations'
  end

  def perform(start_id, end_id)
    Integration.where(id: start_id..end_id).each do |integration|
      json = JSON.load(integration.properties)

      integration.update(url: json['url']) if json['url']
    rescue JSON::ParserError
      # If the JSON is invalid we don't want to keep the job around forever,
      # instead we'll just leave the "url" field to whatever the default value
      # is.
      next
    end
  end
end

Next we'll need to adjust our code so we schedule the above migration for newly created and updated integrations. We can do this using something along the lines of the following:

class Integration < ActiveRecord::Base
  after_commit :schedule_integration_migration, on: :update
  after_commit :schedule_integration_migration, on: :create

  def schedule_integration_migration
    BackgroundMigrationWorker.perform_async('ExtractIntegrationsUrl', [id, id])
  end
end

We're using after_commit here to ensure the Sidekiq job is not scheduled before the transaction completes as doing so can lead to race conditions where the changes are not yet visible to the worker.

Next we'll need a post-deployment migration that schedules the migration for existing data.

class ScheduleExtractIntegrationsUrl < Gitlab::Database::Migration[1.0]
  disable_ddl_transaction!

  MIGRATION = 'ExtractIntegrationsUrl'
  DELAY_INTERVAL = 2.minutes

  def up
    queue_background_migration_jobs_by_range_at_intervals(
      define_batchable_model('integrations'),
      MIGRATION,
      DELAY_INTERVAL)
  end

  def down
  end
end

Once deployed our application will continue using the data as before but at the same time will ensure that both existing and new data is migrated.

In the next release we can remove the after_commit hooks and related code. We will also need to add a post-deployment migration that consumes any remaining jobs and manually run on any un-migrated rows. Such a migration would look like this:

class ConsumeRemainingExtractIntegrationsUrlJobs < Gitlab::Database::Migration[1.0]
  disable_ddl_transaction!

  def up
    # This must be included
    Gitlab::BackgroundMigration.steal('ExtractIntegrationsUrl')

    # This should be included, but can be skipped - see below
    define_batchable_model('integrations').where(url: nil).each_batch(of: 50) do |batch|
      range = batch.pluck('MIN(id)', 'MAX(id)').first

      Gitlab::BackgroundMigration::ExtractIntegrationsUrl.new.perform(*range)
    end
  end

  def down
  end
end

The final step runs for any un-migrated rows after all of the jobs have been processed. This is in case a Sidekiq process running the background migrations received SIGKILL, leading to the jobs being lost. (See more reliable Sidekiq queue for more information.)

If the application does not depend on the data being 100% migrated (for instance, the data is advisory, and not mission-critical), then this final step can be skipped.

This migration will then process any jobs for the ExtractIntegrationsUrl migration and continue once all jobs have been processed. Once done you can safely remove the integrations.properties column.

Testing

It is required to write tests for:

  • The background migrations' scheduling migration.
  • The background migration itself.
  • A cleanup migration.

The :migration and schema: :latest RSpec tags are automatically set for background migration specs. See the Testing Rails migrations style guide.

Keep in mind that before and after RSpec hooks are going to migrate you database down and up, which can result in other background migrations being called. That means that using spy test doubles with have_received is encouraged, instead of using regular test doubles, because your expectations defined in a it block can conflict with what is being called in RSpec hooks. See issue #35351 for more details.

Best practices

  1. Make sure to know how much data you're dealing with.

  2. Make sure that background migration jobs are idempotent.

  3. Make sure that tests you write are not false positives.

  4. Make sure that if the data being migrated is critical and cannot be lost, the clean-up migration also checks the final state of the data before completing.

  5. When migrating many columns, make sure it won't generate too many dead tuples in the process (you may need to directly query the number of dead tuples and adjust the scheduling according to this piece of data).

  6. Make sure to discuss the numbers with a database specialist, the migration may add more pressure on DB than you expect (measure on staging, or ask someone to measure on production).

  7. Make sure to know how much time it'll take to run all scheduled migrations.

  8. Provide an estimation section in the description, estimating both the total migration run time and the query times for each background migration job. Explain plans for each query should also be provided.

    For example, assuming a migration that deletes data, include information similar to the following section:

    Background Migration Details:
    
    47600 items to delete
    batch size = 1000
    47600 / 1000 = 48 batches
    
    Estimated times per batch:
    - 820ms for select statement with 1000 items (see linked explain plan)
    - 900ms for delete statement with 1000 items (see linked explain plan)
    Total: ~2 sec per batch
    
    2 mins delay per batch (safe for the given total time per batch)
    
    48 batches * 2 min per batch = 96 mins to run all the scheduled jobs
    

    The execution time per batch (2 sec in this example) is not included in the calculation for total migration time. The jobs are scheduled 2 minutes apart without knowledge of the execution time.

Additional tips and strategies

Nested batching

A strategy to make the migration run faster is to schedule larger batches, and then use EachBatch within the background migration to perform multiple statements.

The background migration helpers that queue multiple jobs such as queue_background_migration_jobs_by_range_at_intervals use EachBatch. The example above has batches of 1000, where each queued job takes two seconds. If the query has been optimized to make the time for the delete statement within the query performance guidelines, 1000 may be the largest number of records that can be deleted in a reasonable amount of time.

The minimum and most common interval for delaying jobs is two minutes. This results in two seconds of work for each two minute job. There's nothing that prevents you from executing multiple delete statements in each background migration job.

Looking at the example above, you could alternatively do:

Background Migration Details:

47600 items to delete
batch size = 10_000
47600 / 10_000 = 5 batches

Estimated times per batch:
- Records are updated in sub-batches of 1000 => 10_000 / 1000 = 10 total updates
- 820ms for select statement with 1000 items (see linked explain plan)
- 900ms for delete statement with 1000 items (see linked explain plan)
Sub-batch total: ~2 sec per sub-batch,
Total batch time: 2 * 10 = 20 sec per batch

2 mins delay per batch

5 batches * 2 min per batch = 10 mins to run all the scheduled jobs

The batch time of 20 seconds still fits comfortably within the two minute delay, yet the total run time is cut by a tenth from around 100 minutes to 10 minutes! When dealing with large background migrations, this can cut the total migration time by days.

When batching in this way, it is important to look at query times on the higher end of the table or relation being updated. EachBatch may generate some queries that become much slower when dealing with higher ID ranges.

Delay time

When looking at the batch execution time versus the delay time, the execution time should fit comfortably within the delay time for a few reasons:

  • To allow for a variance in query times.
  • To allow autovacuum to catch up after periods of high churn.

Never try to optimize by fully filling the delay window even if you are confident the queries themselves have no timing variance.

Background jobs tracking

queue_background_migration_jobs_by_range_at_intervals can create records for each job that is scheduled to run. You can enable this behavior by passing track_jobs: true. Each record starts with a pending status. Make sure that your worker updates the job status to succeeded by calling Gitlab::Database::BackgroundMigrationJob.mark_all_as_succeeded in the perform method of your background migration.

# Background migration code

def perform(start_id, end_id)
  # do work here

  mark_job_as_succeeded(start_id, end_id)
end

private

# Make sure that the arguments passed here match those passed to the background
# migration
def mark_job_as_succeeded(*arguments)
 Gitlab::Database::BackgroundMigrationJob.mark_all_as_succeeded(
    self.class.name.demodulize,
    arguments
  )
end
# Post deployment migration
MIGRATION = 'YourBackgroundMigrationName'
DELAY_INTERVAL = 2.minutes.to_i # can be different
BATCH_SIZE = 10_000 # can be different

disable_ddl_transaction!

def up
  queue_background_migration_jobs_by_range_at_intervals(
    define_batchable_model('name_of_the_table_backing_the_model'),
    MIGRATION,
    DELAY_INTERVAL,
    batch_size: BATCH_SIZE,
    track_jobs: true
  )
end

def down
  # no-op
end

See lib/gitlab/background_migration/drop_invalid_vulnerabilities.rb for a full example.

Rescheduling pending jobs

You can reschedule pending migrations from the background_migration_jobs table by creating a post-deployment migration and calling requeue_background_migration_jobs_by_range_at_intervals with the migration name and delay interval.

# Post deployment migration
MIGRATION = 'YourBackgroundMigrationName'
DELAY_INTERVAL = 2.minutes

disable_ddl_transaction!

def up
  requeue_background_migration_jobs_by_range_at_intervals(MIGRATION, DELAY_INTERVAL)
end

def down
  # no-op
end

See db/post_migrate/20210604070207_retry_backfill_traversal_ids.rb for a full example.

Viewing failure error logs

After running a background migration, if any jobs have failed, you can view the logs in Kibana. View the production Sidekiq log and filter for:

  • json.class: BackgroundMigrationWorker
  • json.job_status: fail
  • json.meta.caller_id: <MyBackgroundMigrationSchedulingMigrationClassName>
  • json.args: <MyBackgroundMigrationClassName>

Looking at the json.error_class, json.error_message and json.error_backtrace values may be helpful in understanding why the jobs failed.

Depending on when and how the failure occurred, you may find other helpful information by filtering with json.class: <MyBackgroundMigrationClassName>.