Building Resilient Rails Applications: Essential Patterns for Handling Failures and High Traffic Gracefully
Build resilient Rails apps that handle failures gracefully. Learn circuit breakers, bulkheads, retries & fallbacks to prevent cascading failures. Keep your app running when services fail.
I want to talk about building Rails applications that don’t break when things go wrong. In my experience, things always go wrong. A payment processor gets slow. A database connection drops. An email service has an outage. The question isn’t if these events will happen, but when. The goal isn’t to prevent every possible failure—that’s impossible. The goal is to build an application that can handle failure gracefully, keep the important parts working, and recover on its own.
This is what we mean by resilience. It’s the difference between a complete site outage and a slightly degraded experience. It’s the ability for your checkout process to work even when the recommendation engine is having a bad day. Let’s look at some practical ways to build this toughness into a Rails application.
Imagine you’re calling an external API to charge a credit card. If that service starts timing out or returning errors, what happens? A naive approach might keep trying, over and over. This is a disaster. Your application threads get stuck waiting. Your database connections pool drains. One failing external service can drag your entire application down. This is called a cascading failure.
We need a way to stop calling a broken service. Think of it like an electrical circuit breaker. When too much current flows, the breaker “trips” to prevent damage. We can do the same in code. Here’s a basic idea of how that looks.
class CircuitBreaker
def initialize(service, failure_threshold: 5, reset_timeout: 60)
@service = service
@failure_threshold = failure_threshold
@reset_timeout = reset_timeout
@state = :closed
@failure_count = 0
end
def call
if @state == :open
raise CircuitOpenError, "Service unavailable"
end
begin
result = @service.call
@failure_count = 0 # Reset on success
result
rescue => e
@failure_count += 1
if @failure_count >= @failure_threshold
@state = :open
schedule_reset
end
raise e
end
end
private
def schedule_reset
Thread.new do
sleep @reset_timeout
@state = :half_open
end
end
end
# Using it
gateway = CircuitBreaker.new(-> { ExternalPaymentApi.charge(amount) })
begin
gateway.call
rescue CircuitOpenError
# Show a friendly message to the user
"Our payment system is temporarily busy. Please try again in a minute."
end
The breaker starts in a :closed state, letting calls through. If failures pile up past our threshold, it trips into an :open state. In this state, it immediately rejects calls without even trying the service. This gives the failing system time to recover. After a timeout period, it moves to a :half_open state. We could allow one test request through. If it succeeds, we close the breaker again. If it fails, we open it once more. This simple pattern isolates your application from downstream failures.
Now, let’s think about resources. A typical Rails app uses a pool of threads or processes to handle requests. What if one slow service uses up all those threads? Other, perfectly healthy parts of your app can’t work because there are no threads left. It’s like a leak in one cabin sinking the entire ship.
The solution is to build bulkheads. On a ship, a bulkhead is a wall that sections off compartments. A leak floods only one section. We can do the same by separating our application’s resources into isolated pools.
class BulkheadExecutor
def initialize(pool_name, size: 10)
@pool = Concurrent::ThreadPoolExecutor.new(
name: pool_name,
min_threads: 1,
max_threads: size,
max_queue: 100
)
end
def execute(&job)
Concurrent::Future.execute(executor: @pool, &job)
end
end
# Create separate pools for different tasks
$payment_executor = BulkheadExecutor.new('payments', size: 5)
$email_executor = BulkheadExecutor.new('emails', size: 3)
$analytics_executor = BulkheadExecutor.new('analytics', size: 2)
def process_order(order)
# Each service runs in its own isolated pool
payment_future = $payment_executor.execute { charge_card(order) }
email_future = $email_executor.execute { send_confirmation(order) }
analytics_future = $analytics_executor.execute { track_event(order) }
# Wait for the critical one (payment)
payment_future.wait
payment_future.value
end
Here, the payment service has a pool of 5 threads. Even if it gets very slow, it can only use those 5. The email pool has 3. They are isolated. A flood of email jobs won’t stop payments from being processed. You can tune the size of each pool based on the importance and characteristics of the task. This containment is incredibly powerful for stability.
Sometimes, failure is temporary. A network glitch, a brief timeout, a momentary overload. In these cases, we want to try again. But we must do it smartly. Blasting a struggling service with immediate retries can make the problem worse.
We need a retry strategy with patience. This usually means waiting a bit between tries, and increasing that wait each time. This is called exponential backoff. Adding a little random variation, called jitter, helps prevent many clients from retrying at the exact same moment.
class RetryWithBackoff
RETRYABLE_ERRORS = [Net::OpenTimeout, Net::ReadTimeout, SocketError]
def self.attempt(max_attempts: 3, &block)
attempts = 0
begin
attempts += 1
block.call
rescue => e
# Only retry on certain errors
if attempts < max_attempts && retryable?(e)
sleep_time = backoff_duration(attempts)
Rails.logger.info "Retry attempt #{attempts} after #{sleep_time}s for: #{e.message}"
sleep(sleep_time)
retry
else
raise # Re-raise if we're done retrying or error is not retryable
end
end
end
def self.retryable?(error)
RETRYABLE_ERRORS.any? { |klass| error.is_a?(klass) }
end
def self.backoff_duration(attempt)
# Exponential backoff: 1s, 2s, 4s, etc.
base_delay = 1
max_delay = 10
delay = base_delay * (2 ** (attempt - 1))
# Add jitter: up to 10% of the delay
jitter = rand(0.0..0.1) * delay
final_delay = delay + jitter
[final_delay, max_delay].min
end
end
# Using it for a database query
def fetch_user_data
RetryWithBackoff.attempt(max_attempts: 4) do
Database::Replica.connection.execute("SELECT * FROM users WHERE id = ?", user_id)
end
end
This code does a few key things. It classifies errors, so we don’t retry a “card declined” error, only things like timeouts. It waits longer each time. The backoff_duration method calculates a wait that doubles each attempt (1 second, then 2, then 4), caps it at a maximum, and stirs in a little randomness. This gives the remote system a real chance to recover.
What if, despite our circuit breakers and retries, a service is just down? The circuit is open, and we can’t reach it. Do we just show an error page? Not necessarily. We can design our features to have a backup plan, a simpler way to get the job done. This is called a fallback.
The key to a good fallback is that it’s a different, more reliable path, even if it’s less optimal. It’s about graceful degradation.
class ProductRecommendations
def for_user(user)
# 1. Primary: AI-powered, real-time recommendations
recommendations = fetch_ai_recommendations(user)
return recommendations
rescue CircuitOpenError, ServiceUnavailable
# 2. Fallback: User-based similarity from our own DB
Rails.logger.warn "AI service down, using collaborative filter fallback."
recommendations = collaborative_filter(user)
return recommendations if recommendations.any?
rescue => e
# 3. Fallback: Generic popular items
Rails.logger.error "All recommendation strategies failed: #{e.message}"
popular_products.limit(10)
end
private
def fetch_ai_recommendations(user)
# Calls external, possibly flaky service
raise CircuitOpenError if $recommendation_circuit.open?
ExternalAI.query("recommend_for", user.id)
end
def collaborative_filter(user)
# Uses our own database and logs
user.viewed_products.joins(:tags).order('view_count DESC').limit(10)
end
def popular_products
Product.where('sales_count > 100').order('sales_count DESC')
end
end
The user might not get the perfectly personalized list, but they get something sensible. The site remains functional. Fallbacks can be chained: try the best method, if that fails try a good method, if that fails show something basic. This maintains the user experience even when parts of your architecture are unavailable.
To manage all these patterns, we need to know what’s working. We need health checks. But a simple “Is the database up?” check isn’t enough. We need to understand dependencies. If the Redis cache is down, is the “recommendations” service healthy? Probably not, because it depends on Redis.
We can build a health check system that understands these relationships.
class HealthCheck
CHECKS = {
database: -> { ActiveRecord::Base.connection.active? },
redis: -> { Redis.current.ping == "PONG" },
search_index: -> { SearchClient.ping },
payment_gateway: -> { PaymentGateway.health_check }
}
DEPENDENCIES = {
cache: [:redis],
recommendations: [:redis, :database],
checkout: [:database, :payment_gateway]
}
def self.overall_status
results = {}
status = :ok
CHECKS.each do |name, check|
begin
results[name] = { status: check.call ? :healthy : :unhealthy }
rescue => e
results[name] = { status: :error, message: e.message }
status = :service_unavailable
end
end
# Evaluate composite services based on dependencies
DEPENDENCIES.each do |service, deps|
if deps.all? { |dep| results[dep][:status] == :healthy }
results[service] = { status: :healthy }
else
results[service] = { status: :unhealthy, reason: "Failed dependencies: #{deps.join(', ')}" }
status = :service_unavailable
end
end
{ overall: status, details: results }
end
end
# In a controller
class HealthController < ApplicationController
def show
status = HealthCheck.overall_status
http_status = status[:overall] == :ok ? :ok : :service_unavailable
render json: status, status: http_status
end
end
This gives a load balancer or monitoring system a clear signal. If the /health endpoint returns a 503, it knows not to send traffic there. The detailed JSON output tells an engineer exactly which dependency is broken. This is crucial for quick diagnosis and repair.
Time is a resource too. A user request should not hang forever. We need timeouts. But it’s not just one global timeout. A complex operation might have several steps. We should allocate a total budget and divide it among the steps, ensuring we never exceed the total. This is deadline propagation.
class Deadline
def initialize(max_duration_seconds)
@start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
@max_duration = max_duration_seconds
end
def time_remaining
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - @start
[@max_duration - elapsed, 0].max
end
def expired?
time_remaining <= 0
end
def check!
raise DeadlineExceededError if expired?
end
end
def process_user_request
# Total budget for this request: 2 seconds
total_deadline = Deadline.new(2.0)
# Allocate time to sub-tasks
auth_budget = 0.3
db_budget = 0.8
api_budget = 0.7
# Run authentication with its budget
Deadline.new(auth_budget).check! do
authenticate_user!
end
# Run database work with its budget
Deadline.new(db_budget).check! do
load_user_data
end
# Check our total budget is still okay before calling a slow API
total_deadline.check!
Deadline.new(api_budget).check! do
call_external_api
end
end
The Deadline object tracks how much time has passed since it was created. The check! method will raise an error if the time is up. By creating nested deadlines for each step, we ensure that a single slow step doesn’t consume all the time. This keeps our application responsive.
Finally, there are times when the problem isn’t a failing service, but too much success. Traffic spikes can overwhelm your systems. When you’re at capacity, accepting more work makes everything slower and can cause a full crash. It’s better to say “no” quickly to some requests, so you can properly serve others. This is load shedding.
You need a way to know your current load and reject new requests early if you’re at capacity. You can also prioritize.
class RequestAdmitter
def initialize(max_concurrent: 100)
@semaphore = Concurrent::Semaphore.new(max_concurrent)
@max = max_concurrent
end
def admit(priority = :normal)
# Critical requests always get a slot if possible
return true if priority == :critical && @semaphore.available_permits > 0
# Normal requests get admitted only if we have permits
@semaphore.try_acquire
end
def release
@semaphore.release
end
end
# In Rack middleware
class AdmissionMiddleware
def initialize(app)
@app = app
@admitter = RequestAdmitter.new(max_concurrent: 150)
end
def call(env)
# Determine priority from request path or header
priority = env['PATH_INFO'].start_with?('/api/checkout') ? :critical : :normal
if @admitter.admit(priority)
begin
@app.call(env)
ensure
@admitter.release
end
else
# Too busy - reject fast with a 503
[503, { 'Content-Type' => 'text/plain' }, ['Service overloaded']]
end
end
end
This simple semaphore-based approach limits how many requests are being processed at once. If a request is marked as :critical (like a checkout), we try extra hard to let it in. Others might be rejected with a clean 503. This is much better than letting them all in, exhausting database connections, and causing a timeout avalanche that fails everyone.
Putting it all together, resilience is a mindset. It’s about assuming failure will happen and designing your code to expect it. You start with timeouts and retries for transient issues. You use circuit breakers to protect from persistent external failures. You isolate components with bulkheads so one failure can’t spread. You define fallbacks to maintain a basic user experience. You monitor health with awareness of dependencies. You enforce deadlines to stay responsive. And you shed load to protect your core under pressure.
None of this is particularly glamorous, but it’s what separates a hobby project from a robust production system. The code examples I’ve shown are simplified starting points. In a real application, you’d use and configure battle-tested gems like semian for circuit breaking and bulkheading, or sidekiq with its built-in retry mechanisms. The principles, however, remain the same. By weaving these patterns into your Rails application, you build something that can withstand the storms of the real internet and keep serving your users.