Migrating a schema when deploying code is a fairly solved problem in Rails, and many other languages/frameworks. What hasn't quite been solved, or at least agreed upon, is the ability to make changes to data based on those migrations. I'm not going to cover the dos, donts, pros and cons of each method as many people before me have covered this in great detail. What I do want to do however, is share a nice simple solution I created for simplifying this process, and making the most out of previous methods.
What I came up with is a very simplified solution from a combination of my favorite findings around the internet. First of all, I agree that data should never be ran at the time of schema migration. The links I've included cover this in a lot of detail. The TL;DR of it all is you'll lock up your database, cause downtime and learn a valuable career lesson.
Let's start with a basic migration. Nothing fancy, we just want to keep a copy of the timestamp a user last performed something. This is very helpful for being able to filter and sort users based on this field, without joins and crazy nested ActiveModel helpers. I use this technique often for fields that are queried extremely often, but join on a table with millions of rows and frequent writes. Do the work once at write-time, then read-time performance is greatly improved.
class AddLastSmiledAtColumnToUsers < ActiveRecord::Migration[5.1]
def change
add_column :users, :last_smiled_at, :datetime
add_index :users, :last_smiled_at
end
end
Now that we have this column, what we want to do is populate all users with the most recent value from the join table so that our app is usable. Naive instinct (I was guilty of this a few years ago) is to just pop a loop inside the change method to update all of the users. As I mentioned, the linked articles cover this in detail but the summary is you'll end up with downtime on even small datasets.
You want the data to be separate, but ideally also linked for reference. A gem called migration_data attempts to solve this by adding a data
method. This solves the problem of having the code separate from the change method, but since it's ran at the same time as the schema migrations, it is still prone to the downtime problem.
My solution is very similar, but I actually embed a whole extra class onto the migration object to separate it even further.
class AddLastSmiledAtColumnToUsers < ActiveRecord::Migration[5.1]
def change
add_column :users, :last_smiled_at, :datetime
add_index :users, :last_smiled_at
end
class Data
def up
User.all.find_in_batches(batch_size: 250).each do |group|
ActiveRecord::Base.transaction do
group.each do |user|
user.last_smiled_at = user.smiles.last.created_at
user.save if user.changed?
end
end
end
end
end
end
While this doesn't appear obviously different, it has a few advantages. It is not run as part of the schema migrations, nor is it part of the actual migration class. It's a raw ruby class, so writing a test for this is extremely easy. You can detect N+1 performance using bullet, ensure the data is correct post-migration and much more without even initializing the ActiveRecord::Migration base. All of that, and it's also tightly linked to the migration for reference and code review.
A final reason I chose this over the data
method is at some point in the future, I have a feeling ActiveRecord core will attempt to handle data migration and that method may become in use. Keeping our code outside of the core class prevents this.
Now that we can write, test, commit and review our data migration code we need to be able to run it. I avoid including gems as much as possible when you can achieve the same with a few lines of (sometimes hacky) code, so I have a quick two-liner that is responsible for running this data migration.
Dir.glob("#{Rails.root}/db/migrate/*.rb").each { |file| require file }
AddLastSmiledAtColumnToUsers::Data.new.up
There are plenty of clever ways to only include the file for the migration you are interested in, but if migrations are written correctly then there's little to no overhead of including them all as they only define class constants. You're probably going to be running this in a one off dyno or equivalent, so the memory is thrown out afterwards anyways.
Another way you could simplify the running of this, and even automate it is to use active_job and have a DataMigrationRunnerJob
. Ideally you would want to keep track of which data migrations have ran (similar to how schema migrations are tracked) and implement detailed logging in case something goes wrong.
This has been my personal preference for managing data migrations because it doesn't require any gems, you are in full control of how and when the data script is ran and the code is also committed alongside your schema migration code.
After writing this, I came across the data-migrate rubygem which seems like a very well put together solution for huge applications. My personal preference still remains, but if you're a gem collector or want a full end to end solution with rake tasks and tests, then maybe data-migrate
is for you.
Feel free to share your own solutions and/or offer feedback on what I've come up with on Twitter
or in the comments below.
Resources
- Helpful articles about Rails data migration