Distributed systems demand resilience. When one service fails, others shouldn’t cascade into collapse. I’ve seen this firsthand during major outages where a single database timeout rippled through payment processing and notification services. Circuit breakers prevent this by isolating failing components. Let’s examine six practical Ruby techniques to build robust circuit breakers.
Failure thresholds define when to trip the breaker. Setting this requires balancing sensitivity and stability. Too low causes false positives; too high risks prolonged failures. Here’s how I configure thresholds dynamically based on traffic volume:
class AdaptiveThresholdBreaker
def initialize(service)
@service = service
@min_threshold = 3
@base_threshold = 10
@request_count = 0
end
def call
@request_count += 1
calculate_threshold
# ... implementation
end
private
def calculate_threshold
current_threshold = if @request_count < 100
@min_threshold
else
[@base_threshold, (@request_count * 0.1).to_i].min
end
current_threshold
end
end
State transitions form the breaker’s core logic. The classic states are closed, open, and half-open. I implement them as finite state machines with clear transition rules:
require 'finite_machine'
class StateMachineBreaker
def initialize(service)
@service = service
@fsm = FiniteMachine.define do
initial :closed
event :trip, :closed => :open
event :reset, :open => :half_open
event :confirm, :half_open => :closed
event :retry, :half_open => :open
end
end
def call
case @fsm.current
when :closed
execute_service
when :open
handle_open_state
when :half_open
attempt_reset
end
end
# ... state-specific methods
end
Fallback operations maintain functionality during failures. I prefer context-aware fallbacks over static responses. For an order processing service, this might mean using cached inventory data:
class OrderService
def fallback(request)
{
status: :degraded,
inventory: Rails.cache.fetch('inventory_snapshot', expires_in: 1.hour) { legacy_stock_check },
message: "Using cached inventory data"
}
end
def legacy_stock_check
# ... fetch from secondary source
end
end
Graceful degradation preserves core features when dependencies fail. In an e-commerce system, I prioritize checkout over recommendations. This tiered approach maintains revenue-critical paths:
class FeatureFlags
def self.essential?(feature)
case feature
when :checkout, :cart then true
when :recommendations, :reviews then false
end
end
end
class CircuitBreaker
def call(operation)
if FeatureFlags.essential?(operation)
execute_essential(operation)
else
execute_non_essential(operation)
end
end
end
Health monitoring integration provides real-time insights. I combine metrics with semantic logging to track breaker activity:
class InstrumentedBreaker < CircuitBreaker
def execute_service
result = nil
duration = Benchmark.realtime { result = super }
StatsD.distribution('breaker.latency', duration)
LogStructuredData.emit(event: :service_call, state: @state)
result
end
def fallback
StatsD.increment('breaker.fallback')
super
end
end
Dynamic timeout adjustment responds to network conditions. During peak hours, I automatically extend timeouts while maintaining safeguards:
class AdaptiveTimeoutBreaker
def initialize(service)
@service = service
@base_timeout = 2.0
@timeout_factor = 1.0
end
def call
adjust_timeout_based_on_health
Timeout.timeout(calculated_timeout) { @service.call }
end
private
def calculated_timeout
@base_timeout * @timeout_factor
end
def adjust_timeout_based_on_health
health_score = HealthMonitor.current_score
@timeout_factor = case health_score
when 0..60 then 1.8 # Degraded performance
when 61..80 then 1.3
else 1.0
end
end
end
These patterns form a toolkit for resilient Ruby systems. Start with basic failure thresholds, then layer in state management and fallbacks. Add monitoring before implementing advanced features like dynamic timeouts. Through gradual refinement, you’ll create systems that fail gracefully and recover intelligently. Remember to test breaker behavior under simulated failure conditions - I’ve caught critical flaws by injecting network partitions during CI/CD runs. Resilience isn’t an afterthought; it’s the foundation of trustworthy distributed systems.