Build Bulletproof Ruby Background Jobs: Patterns for Handling Real Production Failures
Learn 7 essential Ruby background job patterns for idempotent processing, job chaining, batch handling & error recovery. Build reliable production systems.
When you’re building Ruby applications that real people depend on, the work doesn’t always finish when a web request ends. Sending emails, processing videos, analyzing data—these tasks can take seconds, minutes, or even hours. If you make a user wait for that, they’ll leave. That’s where background jobs come in. They’re the silent engines that handle the heavy lifting out of sight.
But moving work to the background is just the first step. In a live production environment, things get messy. Jobs fail, servers restart, databases get slow, and third-party services go offline. Over the years, I’ve learned that basic job queuing isn’t enough. You need robust patterns to handle these realities. Let’s walk through some of the most effective strategies I use to keep background processing reliable.
First, let’s talk about making jobs safe to run more than once. This is called idempotency. Imagine you have a job that charges a customer’s credit card. What happens if that job gets queued twice by mistake, or retries after a timeout? You don’t want to charge someone twice. An idempotent job ensures that even if you run it ten times, the final outcome is the same as running it once.
Here’s how I often structure this. The key is to track state and use locks to prevent two workers from acting on the same resource at the same time.
class ProcessOrderJob
include Sidekiq::Job
sidekiq_options retry: 5
def perform(order_id, attempt_key = nil)
attempt_key ||= "#{order_id}-#{SecureRandom.hex(8)}"
redis_key = "job:process_order:#{order_id}:attempt"
current_attempt = Sidekiq.redis { |r| r.get(redis_key) }
if current_attempt && current_attempt != attempt_key
logger.info "Order #{order_id} is already being processed."
return
end
Sidekiq.redis { |r| r.setex(redis_key, 300, attempt_key) }
begin
order = Order.find(order_id)
process_order_safely(order)
ensure
Sidekiq.redis do |r|
r.del(redis_key) if r.get(redis_key) == attempt_key
end
end
end
private
def process_order_safely(order)
Order.transaction do
order.with_lock do
return if order.processed_at.present?
charge_payment(order)
update_inventory(order)
send_notifications(order)
order.update!(processed_at: Time.current)
end
end
end
end
This pattern uses Redis to set a short-lived lock. The first job to arrive claims the lock. Any other job that comes along sees the lock and quietly exits. Inside, the database transaction and lock ensure the core logic only executes if the order hasn’t already been processed. It’s a belt-and-suspenders approach that has saved me from many duplicate charges.
Sometimes, work isn’t a single task but a sequence. You must validate a user, then charge them, then ship their order, then send a confirmation. These steps depend on each other. Running them in the wrong order, or before a prerequisite is finished, causes chaos. For this, I use job chaining.
The idea is to make jobs aware of their dependencies. A job shouldn’t start its real work until the job before it has finished successfully.
class PaymentJob
include Sidekiq::Job
def perform(order_id, options = {})
if options[:depends_on]
dependent_job = Sidekiq::ScheduledSet.new.find_job(options[:depends_on])
if dependent_job && !dependent_job.completed?
self.class.perform_in(5.seconds, order_id, options)
return
end
end
process_payment(order_id)
end
end
In this snippet, the PaymentJob receives the ID of the job it depends on. It checks the status of that job. If it’s not done, the payment job reschedules itself for a few seconds later. This polling is simple but effective. For more complex workflows, I’ll store the state of the entire pipeline in Redis to monitor progress and handle failures at each stage.
Now, what about processing 100,000 records? You can’t throw that into one giant job. It will be slow, and if it fails, you lose all progress. The solution is batch processing. Break the big list into small chunks and process each chunk as its own job. The trick is tracking the overall progress so you know how things are going.
I create a batch coordinator that sets up a tracking record in Redis. Then it fans out, creating one job per chunk of data.
class BatchProcessor
BATCH_SIZE = 1000
def process_all(record_ids)
batch_id = SecureRandom.uuid
batches = record_ids.each_slice(BATCH_SIZE).to_a
Sidekiq.redis do |r|
r.multi do
r.hset("batch:#{batch_id}", "total", record_ids.count)
r.hset("batch:#{batch_id}", "processed", 0)
r.hset("batch:#{batch_id}", "batches_total", batches.count)
r.hset("batch:#{batch_id}", "batches_completed", 0)
end
end
batches.each_with_index do |batch, index|
ProcessBatchJob.perform_async(batch_id, batch, index, batches.count)
end
batch_id
end
end
Each ProcessBatchJob reports back its success and failure counts. The Redis counters are updated atomically. At the end, I can check these counters to see if the entire batch is done, or if some chunks need to be retried. This gives users a progress bar and makes the system resilient.
Not all jobs are triggered by user actions. Many need to run on a schedule: nightly reports, weekly cleanups, hourly cache refreshes. While you can use system cron, I prefer to keep the scheduling within the application using cron-like patterns. This keeps all the job logic in one place.
I create a scheduler class that knows about all the recurring tasks. It uses a cron parser to figure out the next run time and schedules the job accordingly.
class RecurringJobScheduler
SCHEDULES = {
daily_midnight: {
class: 'DailyReportJob',
cron: '0 0 * * *',
args: []
}
}.freeze
def schedule_job(name, config)
cron_parser = CronParser.new(config[:cron])
next_time = cron_parser.next(Time.current)
job_class = config[:class].constantize
job_class.set(wait_until: next_time).perform_async(*config[:args])
end
end
A related pattern is the self-rescheduling job. Imagine a job that needs to run every few minutes indefinitely, like checking a queue for new messages. Instead of relying on an external scheduler, the job can schedule its own next run before it finishes.
class SelfReschedulingJob
include Sidekiq::Job
def perform(interval = 300)
begin
collect_metrics
ensure
self.class.perform_in(interval, interval)
end
end
end
The ensure block is crucial. It means the job will reschedule itself even if the main logic crashes. This creates a steady heartbeat of work.
For really large data processing tasks, I sometimes borrow a concept from big data: map-reduce. The goal is to take a huge dataset, split it up for parallel processing (map), and then combine the results (reduce). This pattern is great for tasks that can be broken into independent pieces.
You start with a coordinator job that splits the data and launches many “map” jobs.
class MapReduceJob
include Sidekiq::Job
def perform(input_data_id, chunk_size = 100)
input_data = fetch_input_data(input_data_id)
chunks = split_into_chunks(input_data, chunk_size)
map_tasks = chunks.map.with_index do |chunk, index|
MapTaskJob.perform_async(input_data_id, chunk, index)
end
ReduceTaskJob.perform_in(60.seconds, input_data_id, map_tasks.count)
end
end
Each MapTaskJob processes its chunk and stores the result in Redis with a unique key. The ReduceTaskJob is scheduled to run after a delay, giving the map jobs time to finish. When it runs, it collects all the stored results, combines them, and saves a final output. This pattern lets you process vast amounts of data quickly by using many workers at once.
In a distributed system, your jobs often depend on other services: a payment gateway, an email API, a file storage service. When those services have problems, you need to be a good citizen. If the payment gateway is slow and timing out, continuing to hammer it with retries can make everything worse. This is where the circuit breaker pattern shines.
A circuit breaker watches for failures. If failures pass a threshold, it “trips” and stops all further requests for a period of time. This gives the failing service time to recover.
class JobCircuitBreaker
def initialize(service_name, threshold: 5, timeout: 60)
@service_name = service_name
@failure_threshold = threshold
@reset_timeout = timeout
end
def execute(&block)
check_state
begin
result = yield
record_success
result
rescue => e
record_failure(e)
raise
end
end
private
def check_state
state_key = "circuit:#{@service_name}:state"
state = Sidekiq.redis { |r| r.get(state_key) }
if state == 'open'
opened_at = Sidekiq.redis { |r| r.get("circuit:#{@service_name}:opened_at") }.to_i
if Time.current.to_i - opened_at > @reset_timeout
Sidekiq.redis do |r|
r.set(state_key, 'half_open')
r.set("circuit:#{@service_name}:failures", 0)
end
else
raise CircuitBreaker::OpenCircuitError
end
end
end
def record_failure(error)
failures_key = "circuit:#{@service_name}:failures"
failures = Sidekiq.redis { |r| r.incr(failures_key) }.to_i
if failures >= @failure_threshold
Sidekiq.redis do |r|
r.set("circuit:#{@service_name}:state", 'open')
r.set("circuit:#{@service_name}:opened_at", Time.current.to_i)
end
end
end
end
You wrap your fragile external call inside the breaker’s execute block. If the service starts failing, the breaker will open and quickly fail subsequent jobs, preventing a backlog and system strain. After a cooling-off period, it lets one request through to test if the service is healthy again.
Finally, not all jobs are created equal. A job generating a weekly digest can wait. A job processing a user’s immediate purchase cannot. This is where priority queues come in. By separating jobs into different queues based on urgency, you can ensure critical work gets done first.
I set up multiple queues—like critical, high, default, and low—and assign different numbers of worker processes to each.
class PriorityQueueManager
def distribute_work(job_class, arguments, priority: :default)
queue = queue_for_priority(priority)
job_class.set(queue: queue).perform_async(*arguments)
end
def queue_for_priority(priority)
case priority
when :critical then 'critical'
when :high then 'high'
when :low then 'low'
else 'default'
end
end
end
You can even make priority dynamic. A job might start as default, but if it’s been retrying for an hour, you might bump it to high. I sometimes add logic to jobs that checks the age or importance of the data it’s handling and chooses a queue accordingly.
These patterns aren’t just theoretical. They’re the building blocks I use to create background processing systems that are robust, observable, and efficient. They handle failure gracefully, provide visibility into progress, and ensure that the most important work gets the resources it needs. Start with idempotency and reliable scheduling, then layer in batching and prioritization as your scale demands. The goal is to make your background workers as dependable as the sunrise.