Build Bulletproof Ruby Background Jobs: Patterns for Handling Real Production Failures

Learn 7 essential Ruby background job patterns for idempotent processing, job chaining, batch handling & error recovery. Build reliable production systems.

Build Bulletproof Ruby Background Jobs: Patterns for Handling Real Production Failures

When you’re building Ruby applications that real people depend on, the work doesn’t always finish when a web request ends. Sending emails, processing videos, analyzing data—these tasks can take seconds, minutes, or even hours. If you make a user wait for that, they’ll leave. That’s where background jobs come in. They’re the silent engines that handle the heavy lifting out of sight.

But moving work to the background is just the first step. In a live production environment, things get messy. Jobs fail, servers restart, databases get slow, and third-party services go offline. Over the years, I’ve learned that basic job queuing isn’t enough. You need robust patterns to handle these realities. Let’s walk through some of the most effective strategies I use to keep background processing reliable.

First, let’s talk about making jobs safe to run more than once. This is called idempotency. Imagine you have a job that charges a customer’s credit card. What happens if that job gets queued twice by mistake, or retries after a timeout? You don’t want to charge someone twice. An idempotent job ensures that even if you run it ten times, the final outcome is the same as running it once.

Here’s how I often structure this. The key is to track state and use locks to prevent two workers from acting on the same resource at the same time.

class ProcessOrderJob
  include Sidekiq::Job
  sidekiq_options retry: 5

  def perform(order_id, attempt_key = nil)
    attempt_key ||= "#{order_id}-#{SecureRandom.hex(8)}"
    redis_key = "job:process_order:#{order_id}:attempt"

    current_attempt = Sidekiq.redis { |r| r.get(redis_key) }
    if current_attempt && current_attempt != attempt_key
      logger.info "Order #{order_id} is already being processed."
      return
    end

    Sidekiq.redis { |r| r.setex(redis_key, 300, attempt_key) }

    begin
      order = Order.find(order_id)
      process_order_safely(order)
    ensure
      Sidekiq.redis do |r|
        r.del(redis_key) if r.get(redis_key) == attempt_key
      end
    end
  end

  private

  def process_order_safely(order)
    Order.transaction do
      order.with_lock do
        return if order.processed_at.present?

        charge_payment(order)
        update_inventory(order)
        send_notifications(order)

        order.update!(processed_at: Time.current)
      end
    end
  end
end

This pattern uses Redis to set a short-lived lock. The first job to arrive claims the lock. Any other job that comes along sees the lock and quietly exits. Inside, the database transaction and lock ensure the core logic only executes if the order hasn’t already been processed. It’s a belt-and-suspenders approach that has saved me from many duplicate charges.

Sometimes, work isn’t a single task but a sequence. You must validate a user, then charge them, then ship their order, then send a confirmation. These steps depend on each other. Running them in the wrong order, or before a prerequisite is finished, causes chaos. For this, I use job chaining.

The idea is to make jobs aware of their dependencies. A job shouldn’t start its real work until the job before it has finished successfully.

class PaymentJob
  include Sidekiq::Job

  def perform(order_id, options = {})
    if options[:depends_on]
      dependent_job = Sidekiq::ScheduledSet.new.find_job(options[:depends_on])
      if dependent_job && !dependent_job.completed?
        self.class.perform_in(5.seconds, order_id, options)
        return
      end
    end
    process_payment(order_id)
  end
end

In this snippet, the PaymentJob receives the ID of the job it depends on. It checks the status of that job. If it’s not done, the payment job reschedules itself for a few seconds later. This polling is simple but effective. For more complex workflows, I’ll store the state of the entire pipeline in Redis to monitor progress and handle failures at each stage.

Now, what about processing 100,000 records? You can’t throw that into one giant job. It will be slow, and if it fails, you lose all progress. The solution is batch processing. Break the big list into small chunks and process each chunk as its own job. The trick is tracking the overall progress so you know how things are going.

I create a batch coordinator that sets up a tracking record in Redis. Then it fans out, creating one job per chunk of data.

class BatchProcessor
  BATCH_SIZE = 1000

  def process_all(record_ids)
    batch_id = SecureRandom.uuid
    batches = record_ids.each_slice(BATCH_SIZE).to_a

    Sidekiq.redis do |r|
      r.multi do
        r.hset("batch:#{batch_id}", "total", record_ids.count)
        r.hset("batch:#{batch_id}", "processed", 0)
        r.hset("batch:#{batch_id}", "batches_total", batches.count)
        r.hset("batch:#{batch_id}", "batches_completed", 0)
      end
    end

    batches.each_with_index do |batch, index|
      ProcessBatchJob.perform_async(batch_id, batch, index, batches.count)
    end
    batch_id
  end
end

Each ProcessBatchJob reports back its success and failure counts. The Redis counters are updated atomically. At the end, I can check these counters to see if the entire batch is done, or if some chunks need to be retried. This gives users a progress bar and makes the system resilient.

Not all jobs are triggered by user actions. Many need to run on a schedule: nightly reports, weekly cleanups, hourly cache refreshes. While you can use system cron, I prefer to keep the scheduling within the application using cron-like patterns. This keeps all the job logic in one place.

I create a scheduler class that knows about all the recurring tasks. It uses a cron parser to figure out the next run time and schedules the job accordingly.

class RecurringJobScheduler
  SCHEDULES = {
    daily_midnight: {
      class: 'DailyReportJob',
      cron: '0 0 * * *',
      args: []
    }
  }.freeze

  def schedule_job(name, config)
    cron_parser = CronParser.new(config[:cron])
    next_time = cron_parser.next(Time.current)

    job_class = config[:class].constantize
    job_class.set(wait_until: next_time).perform_async(*config[:args])
  end
end

A related pattern is the self-rescheduling job. Imagine a job that needs to run every few minutes indefinitely, like checking a queue for new messages. Instead of relying on an external scheduler, the job can schedule its own next run before it finishes.

class SelfReschedulingJob
  include Sidekiq::Job

  def perform(interval = 300)
    begin
      collect_metrics
    ensure
      self.class.perform_in(interval, interval)
    end
  end
end

The ensure block is crucial. It means the job will reschedule itself even if the main logic crashes. This creates a steady heartbeat of work.

For really large data processing tasks, I sometimes borrow a concept from big data: map-reduce. The goal is to take a huge dataset, split it up for parallel processing (map), and then combine the results (reduce). This pattern is great for tasks that can be broken into independent pieces.

You start with a coordinator job that splits the data and launches many “map” jobs.

class MapReduceJob
  include Sidekiq::Job

  def perform(input_data_id, chunk_size = 100)
    input_data = fetch_input_data(input_data_id)
    chunks = split_into_chunks(input_data, chunk_size)

    map_tasks = chunks.map.with_index do |chunk, index|
      MapTaskJob.perform_async(input_data_id, chunk, index)
    end

    ReduceTaskJob.perform_in(60.seconds, input_data_id, map_tasks.count)
  end
end

Each MapTaskJob processes its chunk and stores the result in Redis with a unique key. The ReduceTaskJob is scheduled to run after a delay, giving the map jobs time to finish. When it runs, it collects all the stored results, combines them, and saves a final output. This pattern lets you process vast amounts of data quickly by using many workers at once.

In a distributed system, your jobs often depend on other services: a payment gateway, an email API, a file storage service. When those services have problems, you need to be a good citizen. If the payment gateway is slow and timing out, continuing to hammer it with retries can make everything worse. This is where the circuit breaker pattern shines.

A circuit breaker watches for failures. If failures pass a threshold, it “trips” and stops all further requests for a period of time. This gives the failing service time to recover.

class JobCircuitBreaker
  def initialize(service_name, threshold: 5, timeout: 60)
    @service_name = service_name
    @failure_threshold = threshold
    @reset_timeout = timeout
  end

  def execute(&block)
    check_state
    begin
      result = yield
      record_success
      result
    rescue => e
      record_failure(e)
      raise
    end
  end

  private

  def check_state
    state_key = "circuit:#{@service_name}:state"
    state = Sidekiq.redis { |r| r.get(state_key) }

    if state == 'open'
      opened_at = Sidekiq.redis { |r| r.get("circuit:#{@service_name}:opened_at") }.to_i
      if Time.current.to_i - opened_at > @reset_timeout
        Sidekiq.redis do |r|
          r.set(state_key, 'half_open')
          r.set("circuit:#{@service_name}:failures", 0)
        end
      else
        raise CircuitBreaker::OpenCircuitError
      end
    end
  end

  def record_failure(error)
    failures_key = "circuit:#{@service_name}:failures"
    failures = Sidekiq.redis { |r| r.incr(failures_key) }.to_i

    if failures >= @failure_threshold
      Sidekiq.redis do |r|
        r.set("circuit:#{@service_name}:state", 'open')
        r.set("circuit:#{@service_name}:opened_at", Time.current.to_i)
      end
    end
  end
end

You wrap your fragile external call inside the breaker’s execute block. If the service starts failing, the breaker will open and quickly fail subsequent jobs, preventing a backlog and system strain. After a cooling-off period, it lets one request through to test if the service is healthy again.

Finally, not all jobs are created equal. A job generating a weekly digest can wait. A job processing a user’s immediate purchase cannot. This is where priority queues come in. By separating jobs into different queues based on urgency, you can ensure critical work gets done first.

I set up multiple queues—like critical, high, default, and low—and assign different numbers of worker processes to each.

class PriorityQueueManager
  def distribute_work(job_class, arguments, priority: :default)
    queue = queue_for_priority(priority)
    job_class.set(queue: queue).perform_async(*arguments)
  end

  def queue_for_priority(priority)
    case priority
    when :critical then 'critical'
    when :high then 'high'
    when :low then 'low'
    else 'default'
    end
  end
end

You can even make priority dynamic. A job might start as default, but if it’s been retrying for an hour, you might bump it to high. I sometimes add logic to jobs that checks the age or importance of the data it’s handling and chooses a queue accordingly.

These patterns aren’t just theoretical. They’re the building blocks I use to create background processing systems that are robust, observable, and efficient. They handle failure gracefully, provide visibility into progress, and ensure that the most important work gets the resources it needs. Start with idempotency and reliable scheduling, then layer in batching and prioritization as your scale demands. The goal is to make your background workers as dependable as the sunrise.


// Keep Reading

Similar Articles

Mastering Rust's Lifetime Rules: Write Safer Code Now
Ruby

Mastering Rust's Lifetime Rules: Write Safer Code Now

Rust's lifetime elision rules simplify code by inferring lifetimes. The compiler uses smart rules to determine lifetimes for functions and structs. Complex scenarios may require explicit annotations. Understanding these rules helps write safer, more efficient code. Mastering lifetimes is a journey that leads to confident coding in Rust.

Read Article →