<Marc Qualie/>

Rails Data Migrations

Migrating a schema when deploying code is a fairly solved problem in Rails, and many other languages/frameworks. What hasn't quite been solved, or at least agreed upon, is the ability to make changes to data based on those migrations. I'm not going to cover the dos, donts, pros and cons of each method as many people before me have covered this in great detail. What I do want to do however, is share a nice simple solution I created for simplifying this process, and making the most out of previous methods.

What I came up with is a very simplified solution from a combination of my favorite findings around the internet. First of all, I agree that data should never be ran at the time of schema migration. The links I've included cover this in a lot of detail. The TL;DR of it all is you'll lock up your database, cause downtime and learn a valuable career lesson.

Let's start with a basic migration. Nothing fancy, we just want to keep a copy of the timestamp a user last performed something. This is very helpful for being able to filter and sort users based on this field, without joins and crazy nested ActiveModel helpers. I use this technique often for fields that are queried extremely often, but join on a table with millions of rows and frequent writes. Do the work once at write-time, then read-time performance is greatly improved.

class AddLastSmiledAtColumnToUsers < ActiveRecord::Migration[5.1]
  def change
    add_column :users, :last_smiled_at, :datetime
    add_index :users, :last_smiled_at
  end
end

Now that we have this column, what we want to do is populate all users with the most recent value from the join table so that our app is usable. Naive instinct (I was guilty of this a few years ago) is to just pop a loop inside the change method to update all of the users. As I mentioned, the linked articles cover this in detail but the summary is you'll end up with downtime on even small datasets.

You want the data to be separate, but ideally also linked for reference. A gem called migration_data attempts to solve this by adding a data method. This solves the problem of having the code separate from the change method, but since it's ran at the same time as the schema migrations, it is still prone to the downtime problem.

My solution is very similar, but I actually embed a whole extra class onto the migration object to separate it even further.

class AddLastSmiledAtColumnToUsers < ActiveRecord::Migration[5.1]
  def change
    add_column :users, :last_smiled_at, :datetime
    add_index :users, :last_smiled_at
  end

  class Data
    def up
      User.all.find_in_batches(batch_size: 250).each do |group|
        ActiveRecord::Base.transaction do
          group.each do |user|
            user.last_smiled_at = user.smiles.last.created_at
            user.save if user.changed?
          end
        end
      end
    end
  end
end

While this doesn't appear obviously different, it has a few advantages. It is not run as part of the schema migrations, nor is it part of the actual migration class. It's a raw ruby class, so writing a test for this is extremely easy. You can detect N+1 performance using bullet, ensure the data is correct post-migration and much more without even initializing the ActiveRecord::Migration base. All of that, and it's also tightly linked to the migration for reference and code review.

A final reason I chose this over the data method is at some point in the future, I have a feeling ActiveRecord core will attempt to handle data migration and that method may become in use. Keeping our code outside of the core class prevents this.

Now that we can write, test, commit and review our data migration code we need to be able to run it. I avoid including gems as much as possible when you can achieve the same with a few lines of (sometimes hacky) code, so I have a quick two-liner that is responsible for running this data migration.

Dir.glob("#{Rails.root}/db/migrate/*.rb").each { |file| require file }
AddLastSmiledAtColumnToUsers::Data.new.up

There are plenty of clever ways to only include the file for the migration you are interested in, but if migrations are written correctly then there's little to no overhead of including them all as they only define class constants. You're probably going to be running this in a one off dyno or equivalent, so the memory is thrown out afterwards anyways.

Another way you could simplify the running of this, and even automate it is to use active_job and have a DataMigrationRunnerJob. Ideally you would want to keep track of which data migrations have ran (similar to how schema migrations are tracked) and implement detailed logging in case something goes wrong.

This has been my personal preference for managing data migrations because it doesn't require any gems, you are in full control of how and when the data script is ran and the code is also committed alongside your schema migration code.

After writing this, I came across the data-migrate rubygem which seems like a very well put together solution for huge applications. My personal preference still remains, but if you're a gem collector or want a full end to end solution with rake tasks and tests, then maybe data-migrate is for you.

Feel free to share your own solutions and/or offer feedback on what I've come up with on Twitter or in the comments below.

Resources

If you have any questions about this post, or anything else, you can get in touch on Bluesky or browse my code on Github.