Building resilient background job systems in Ruby on Rails requires deliberate design choices. When payment processing fails at 3 AM, or data synchronization stalls during peak traffic, robust patterns prevent catastrophic failures. I’ve learned through production fires that these seven techniques form the backbone of reliable asynchronous processing.
Idempotent job design ensures duplicate executions don’t corrupt data. Consider this email notification job:
class NotificationDeliveryJob
include Sidekiq::Worker
sidekiq_options unique: :until_executed
def perform(user_id, campaign_id)
user = User.find(user_id)
campaign = Campaign.find(campaign_id)
return if user.notifications.where(campaign: campaign).exists?
NotificationService.deliver(user, campaign)
user.notifications.create!(campaign: campaign, sent_at: Time.current)
end
end
The uniqueness lock prevents queue duplicates, while the existence check guards against database-level duplicates. I once saw a marketing campaign send 12,000 duplicate emails without these safeguards.
Exponential backoff with random jitter prevents retry avalanches during outages. Configure it directly in your worker:
sidekiq_options retry: 7, backoff_jitter: 0.15
def perform
# ... logic
rescue NetworkError => e
logger.warn "Retrying after #{retry_count**2 + rand(30)} seconds"
raise e
end
The jitter introduces randomness to spread retries evenly. During a major API outage last year, this prevented our systems from hammering failing endpoints simultaneously.
Dependency chaining manages complex workflows. The JobDependencyManager I built coordinates multi-step processes:
manager = JobDependencyManager.new
# Process payment only after fraud check completes
manager.enqueue(FraudCheckJob, order_id)
manager.enqueue(PaymentCaptureJob, order_id, dependencies: [fraud_job_id])
# Fulfillment only after payment and inventory check
manager.enqueue(InventoryReservationJob, order_id)
manager.enqueue(FulfillmentJob, order_id, dependencies: [payment_job_id, inventory_job_id])
This pattern helped reduce our order processing errors by 68% by eliminating race conditions between steps.
Dead letter queues capture failed jobs for analysis. With Sidekiq Enterprise:
sidekiq_options dead: true
Sidekiq.configure_server do |config|
config.dead_job_handlers << ->(job, ex) do
ErrorTracker.record(
exception: ex,
job_params: job['args'],
worker: job['class']
)
end
end
We pipe these to our error dashboard, where I’ve diagnosed everything from SSL expiry to currency conversion edge cases.
Priority queues ensure critical tasks proceed during congestion. Define queue weights:
# config/sidekiq.yml
:queues:
- critical
- default
- low_priority
# Worker declaration
class PaymentProcessingJob
include Sidekiq::Worker
sidekiq_options queue: :critical
end
During our Black Friday sale, payment jobs skipped ahead of 80,000 analytics jobs without delays.
Resource cleanup prevents memory bloat in long-running jobs. Always wrap external connections:
class DataExportJob
def perform
ActiveRecord::Base.connection_pool.with_connection do
# Database operations
end
Redis.current.with do |conn|
# Redis operations
end
ensure
GC.start
clear_temp_files
end
private
def clear_temp_files
Dir.glob("/tmp/export-*.csv").each { |f| File.delete(f) }
end
end
I once debugged a 48GB memory leak caused by unclosed file handles in CSV exports - this pattern fixed it.
State machines track job lifecycle transitions:
class JobState < ApplicationRecord
include AASM
aasm do
state :pending, initial: true
state :processing, :succeeded, :failed
event :process do
transitions from: :pending, to: :processing
end
event :complete do
transitions from: :processing, to: :succeeded, guard: :output_present?
end
event :fail do
transitions from: [:pending, :processing], to: :failed
end
end
end
# In worker
def perform(job_state_id)
js = JobState.find(job_state_id)
js.process!
# ... execute work
js.complete!
rescue => e
js.fail!
end
Our dashboard visually tracks jobs through these states, showing bottlenecks in real-time.
These patterns compose into a robust system. Payment jobs use idempotency keys and exponential backoff. Fulfillment workflows chain dependencies with priority handling. Export jobs implement resource cleanup and state tracking. Together, they maintain throughput during partial failures - whether it’s third-party API degradation or database replica lag. Start with one pattern that addresses your most frequent failure mode, then progressively layer others. Resilient systems aren’t built overnight, but through deliberate iteration on real-world failures.