Let’s talk about keeping your application running smoothly while you update it. It’s a common challenge. You need to fix a bug, add a feature, or update a library, but you can’t afford to have your website go down or throw errors for users. The good news is, with careful planning, you can update almost every part of a live Ruby on Rails application without your users noticing. I want to share with you seven practical ways to do this.
The first step is often the trickiest: changing the database. A simple migration that adds a column to a table with millions of records can lock the table for minutes. Your site grinds to a halt. I’ve learned to break these operations into safe, backward-compatible steps. Instead of adding a column with a default value in one go, which forces the database to rewrite every row immediately, you do it in stages.
You add the column without any constraints first. Then, in small batches, you fill in the values for existing records. Finally, you add the NOT NULL constraint only after all the data is in place. This way, the database never gets overloaded. The same idea applies to renaming a column. You don’t just rename it. You add a new column, write code that updates both columns, deploy that code, then slowly move all data over before removing the old one. It sounds like more work, but it prevents headaches.
# A safer way to add a required column to a large table
def add_column_safely(table, new_column, type)
# Step 1: Add the column, allowing NULL values for now
ActiveRecord::Base.connection.execute(
"ALTER TABLE #{table} ADD COLUMN #{new_column} #{type}"
)
# Step 2: Backfill existing records in manageable batches
ModelName.in_batches(of: 1000) do |relation|
relation.update_all(new_column => 'temporary_default_value')
end
# Step 3: Now set the column to be required
ActiveRecord::Base.connection.execute(
"ALTER TABLE #{table} ALTER COLUMN #{new_column} SET NOT NULL"
)
end
My second go-to technique is using feature flags. This is one of the most powerful tools in my toolkit. Think of a feature flag as a light switch for a piece of code. You can deploy the code for a new dashboard with the switch turned off. The code is there, but no one sees it. Then, when you’re ready, you can turn it on just for yourself, your team, or 5% of your users to test it.
This lets you separate deployment from release. You can ship code on a Tuesday afternoon without stressing, because it’s not active. You can then activate it on a Wednesday morning when everyone is fresh. If something goes wrong, you flip the switch off. The rollback is instantaneous. I use a simple class that checks against a user’s ID or a percentage to control who sees what.
# A simple feature flag check in a controller
class ProjectsController < ApplicationController
def show
if Feature.enabled?(:new_project_ui, current_user)
render :new_show
else
render :old_show
end
end
end
# The flag logic
class Feature
def self.enabled?(flag_name, user)
# Check if flag is on for a specific user
return true if internal_team_user?(user)
# Or, roll out to 10% of users based on their ID
user_hash = Digest::MD5.hexdigest(user.id.to_s).to_i(16)
percentage = user_hash % 100
percentage < 10 # Enable for 10% of users
end
end
Third, we need to handle ongoing requests gracefully when we restart the application server. This is called connection draining. When you tell your server to restart, it shouldn’t just cut everyone off. It should stop accepting new connections and wait a short time for existing requests to finish. A small piece of middleware can help with this.
This middleware checks if we are in a “draining” state. If we are, it immediately returns a polite 503 Service Unavailable status to new requests, telling the client to try again soon. Existing requests continue to be processed. After a set timeout, we can safely restart the server knowing no users were interrupted mid-action.
# Rack middleware to stop accepting new requests
class DrainMiddleware
def initialize(app)
@app = app
@draining = false
end
def call(env)
# If we're draining, send a 'go away' response
if @draining
return [503, { 'Content-Type' => 'text/plain' }, ['Deployment in progress']]
end
# Otherwise, process the request normally
@app.call(env)
end
def start_draining!
@draining = true
end
end
The fourth pattern is the canary release, named after the old mining practice. You release your new code to a very small, controlled subset of your infrastructure or users first—your “canary.” You then watch this group closely for any signs of trouble: increased error rates, slower response times, or problems with business metrics.
If the canary stays healthy, you gradually increase the traffic to the new version. If it gets sick, you immediately redirect traffic back to the stable version. This automated, metrics-based approach lets you catch problems before they affect everyone. I set up monitors to track error rates and latency, and define clear thresholds for failure.
# Pseudo-code for a canary health check
def canary_healthy?(new_server_pool)
# Measure error rate on the new servers
error_rate = monitoring_tool.error_rate(new_server_pool)
return false if error_rate > 0.01 # More than 1% errors is bad
# Measure response time
p99_latency = monitoring_tool.latency(new_server_pool)
return false if p99_latency > 500.milliseconds # Too slow
# If all checks pass, the canary is healthy
true
end
Fifth is the blue-green deployment strategy. This requires a bit more infrastructure but gives you incredible confidence. You have two identical production environments: “Blue” and “Green.” Only one is live at a time. Let’s say Blue is live. You deploy your new application version to the idle Green environment. You run your database migrations there, warm up its caches, and run a suite of smoke tests.
Once Green is verified and ready, you switch your load balancer’s configuration. All new user traffic goes to Green. Blue is now idle. If anything goes wrong, you switch back to Blue instantly. This switch is nearly instantaneous for users. After confirming Green is stable, Blue becomes your staging area for the next deployment.
The sixth concept is all about making your database migrations themselves safe for zero-downtime. Not all migrations are created equal. Some, like adding a new column or creating a new table, are safe. Others, like renaming a column or changing its type, are not. The key is to split dangerous migrations into a series of safe steps that preserve compatibility between your old code and new code.
I always ask: can both the current version of my app and the new version I’m about to deploy work with the database in this intermediate state? If the answer is yes, the migration is safe. This often means writing more complex migrations that use raw SQL for operations like creating indexes concurrently, which doesn’t lock the table.
# A safe, multi-step column rename migration
class SafelyRenameUserLoginToUsername < ActiveRecord::Migration[7.0]
# Step 1: Add the new column
def up
add_column :users, :username, :string
# Copy data from old to new column in background
User.update_all('username = login')
end
# Step 2 (in a later deployment): Remove the old column
def remove_old_column
remove_column :users, :login
end
end
Finally, the seventh pattern is intelligent, automated rollback. Despite our best efforts, things can go wrong. The difference between a minor hiccup and a major outage is often how quickly you can revert. Automated monitoring should watch key signals after a deployment: application error rates, server latency, and even business metrics like sign-up or checkout rates.
I configure alerts so that if error rates jump above a certain point, or if checkout volume suddenly drops, the system doesn’t just page me—it can start an automated rollback procedure. For a blue-green deployment, this means flipping the load balancer back. For a canary, it means setting the traffic percentage to zero. This safety net lets you deploy with much more confidence.
# Monitoring a deployment for auto-rollback
class DeploymentGuard
def monitor(deployment)
start_time = Time.now
loop do
sleep 30
current_error_rate = fetch_error_rate_since(start_time)
current_latency = fetch_p99_latency_since(start_time)
if current_error_rate > 0.05 || current_latency > 2000
puts "⚠️ Problems detected! Initiating rollback..."
deployment.rollback!
break
end
end
end
end
Putting it all together, zero-downtime deployment isn’t a single magic trick. It’s a set of complementary practices. You use safe database migrations to change your data layer. Feature flags give you control over your code releases. Patterns like canary and blue-green deployments manage the risk of launching new versions. Connection draining and automated rollbacks handle the edges and failures gracefully.
Each application is different. A small internal tool might not need a full blue-green setup, while a large e-commerce site might rely on all of these patterns together. The goal is the same: to make deployments a routine, boring event, not a source of stress. By building these practices into your workflow, you can ship code frequently and reliably, keeping your application available to users around the clock.