Production Debugging Guide: Safe Methods to Fix Live Applications Without Breaking Anything
Master production debugging with safe logging, exception tracking, debug endpoints, sandboxed evaluation, profiling & distributed tracing. Turn application mysteries into systematic discovery processes.
I want to talk about a crucial skill: figuring out why your application is acting up when it’s live and real users are depending on it. Production debugging can feel like trying to fix a car engine while it’s still running down the highway. You need methods that are safe, precise, and don’t cause a bigger problem.
The goal is to see what’s happening inside the application without breaking anything. Over time, I’ve found several approaches that work well together. Let me walk you through them.
First, let’s talk about logs. When an error occurs, a simple log message like “Something went wrong” is not very helpful. You need the full story. I make sure every log entry has context. This means attaching a unique identifier to every single request when it first comes in.
Then, every piece of code that logs anything during that request’s life uses that same identifier. It’s like giving every customer in a store a unique receipt number. If they have a problem, you can look up their entire transaction history. You log when the request starts, what parameters it had, who the user was, and when it finishes.
You also have to be careful. Logs can contain sensitive information like passwords or credit card numbers. I always filter those out before they are written. Here is a basic way to structure that logging logic in a Rails application.
def call(env)
request = ActionDispatch::Request.new(env)
request_id = SecureRandom.uuid
Rails.logger.tagged(request_id) do
Rails.logger.info("Starting request to #{request.path}")
begin
status, headers, response = @app.call(env)
Rails.logger.info("Request completed with status #{status}")
[status, headers, response]
rescue => error
Rails.logger.error("Request failed: #{error.message}")
raise
end
end
end
This pattern turns your logs from a pile of confusing messages into a series of clear stories. You can follow one request from beginning to end, even if your server is handling hundreds at the same time.
Logs are great for tracing events, but when an error crashes the whole request, you need more. You need a snapshot of exactly what the application looked like at that moment. This is where structured exception tracking comes in.
When I catch an exception, I don’t just record the error message. I gather everything that might be relevant. What user was logged in? What were the parameters of the request? What other data was in memory? I save all of this together with the error.
def capture_exception(error)
context = {
user_id: current_user&.id,
request_params: request.filtered_parameters,
time: Time.current
}
ExceptionRecord.create!(
message: error.message,
backtrace: error.backtrace,
context: context
)
end
This context is the difference between knowing “there was an error” and knowing “User John Doe got an error when trying to save his profile with these specific fields.” The second one is something you can actually fix.
Sometimes, you need to ask the application a direct question while it’s running. Is the database connected? How much memory is being used? Are background jobs piling up? For this, I create special, secure debug endpoints.
These are like health check pages, but with more detail. They are locked down with secret tokens or allowed IP addresses so only engineers can access them. They never change data; they only read and report. A simple one might check the database.
# In a secure debug controller
def database_status
begin
ActiveRecord::Base.connection.execute("SELECT 1")
status = { connected: true }
rescue => e
status = { connected: false, error: e.message }
end
render json: status
end
I can call this endpoint from my browser or a command line tool. It tells me instantly if the database is reachable. I can make similar endpoints for cache status, queue lengths, or memory usage. It’s a quick way to rule out major system problems.
There are moments when looking at logs and metrics isn’t enough. You need to run a small piece of code to test a hypothesis. Maybe you want to see what a specific calculation returns for a specific user. Doing this directly on a production server is dangerous. You could accidentally delete data.
So, I use a controlled, sandboxed environment. I build a simple console that runs inside a safe boundary. It prevents dangerous commands like deleting files or executing random shell commands. It lets me evaluate simple Ruby code in the context of the live app.
def safe_evaluate(code_string)
# First, block obviously dangerous code
forbidden_words = ['`', 'system', 'exec', 'File.delete']
return "Blocked" if forbidden_words.any? { |word| code_string.include?(word) }
# Run the code in a limited context
begin
result = eval(code_string)
return result.inspect
rescue => e
return "Error: #{e.message}"
end
end
This is a powerful tool. I can ask, “What does User.find(123).account_balance return right now?” and get an immediate answer. It bridges the gap between guessing what’s wrong and knowing what’s wrong.
Performance issues are a special kind of bug. The application works, but it’s slow. To find out why, I add profiling to specific requests. I don’t profile everything all the time—that would slow the whole app down. Instead, I turn it on for just one request, often by adding a special header.
When profiling is on, I measure how long each part of the request takes. How much time was spent in the database? How many queries did we run? How long did the external API call take? I capture all this data and save it with the request’s ID.
# A simplified profiler
def profile_request(request_id)
start_time = Time.now
sql_queries = []
# Hook into database notifications
subscriber = ActiveSupport::Notifications.subscribe("sql.active_record") do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
sql_queries << { sql: event.payload[:sql], duration: event.duration }
end
# ... run the request ...
total_time = Time.now - start_time
profile_data = { request_id: request_id, total_time: total_time, sql_queries: sql_queries }
save_profile(profile_data)
end
Later, I can analyze this profile. I might find one query that’s taking 90% of the time, or I might see the same simple query run hundreds of times in a loop. This tells me exactly where to focus my optimization efforts.
Some bugs are sneaky. They cause the application to gradually use more and more memory until it crashes. To catch these, I take snapshots of the application’s state at different times and compare them.
Think of it like taking two pictures of a room, an hour apart, and spotting what’s different. I record things like how many objects are in memory, how much memory the process is using, and how many database connections are open.
def take_snapshot(label)
{
label: label,
memory_usage: `ps -o rss= -p #{Process.pid}`.to_i,
object_count: ObjectSpace.count_objects[:TOTAL]
}
end
# Usage
snapshot1 = take_snapshot("Before operation")
do_something_big
snapshot2 = take_snapshot("After operation")
puts "Memory increased by #{snapshot2[:memory_usage] - snapshot1[:memory_usage]} bytes"
If a particular task causes a huge spike in memory, this comparison will reveal it. I can then dig deeper to find out which specific objects are being created and not cleaned up.
Modern applications are rarely one single process. A single web request might trigger a background job, which then calls an external service. When there’s a problem, how do you follow the chain of events? You use distributed tracing.
The idea is simple: when a request starts, you generate a unique trace ID. You then pass this ID to every other service or job that gets involved. Every step logs what it’s doing and attaches the same trace ID.
# At the start of a request
trace_id = SecureRandom.uuid
Current.trace_id = trace_id
# When enqueuing a background job
MyJob.perform_later(args, trace_id: trace_id)
# Inside the job
def perform(args, trace_id:)
Current.trace_id = trace_id
logger.info("Job started with trace #{trace_id}")
# ... do work ...
end
Later, I can collect all the logs from the web server, the job worker, and any external service calls that used that trace ID. I can reconstruct the entire life of that single user request, even as it bounced between different parts of the system. It turns a tangled mess of logs into a single, clear timeline.
Putting all this together changes how you handle problems in production. You move from reactive panic to calm investigation. You have the tools to ask the right questions and get clear answers. You start with logs to get the story, use exception tracking for crash snapshots, and hit debug endpoints for system health.
For deeper issues, you might run a safe code snippet, profile a slow request, compare memory snapshots, or follow a trace across services. Each pattern gives you a different lens to look at the problem. The best approach is to have them all ready, so you can choose the right tool for the job.
Building these patterns into your application takes effort upfront. But the first time you solve a tricky production bug in minutes instead of hours, you’ll see the value. It turns debugging from a stressful mystery into a systematic process of discovery.