Let me walk you through how I keep my Ruby on Rails applications running smoothly in production. When your app is live and real users depend on it, you need ways to see what’s happening inside. It’s like having windows into a machine that’s constantly working.
I think of production monitoring as installing sensors throughout your application. These sensors collect information about how everything is performing. Observability goes further—it’s about asking questions of your system and getting useful answers when something unusual happens.
Here’s how I approach this, broken down into practical strategies you can implement.
Structured Logging with Context
Logs are your first line of defense. But default Rails logs often leave you guessing what happened. I add structure and context to every log entry.
# I wrap each request with unique identifiers
class ApplicationController < ActionController::Base
around_action :wrap_with_request_context
private
def wrap_with_request_context
request_id = SecureRandom.uuid
user_id = current_user&.id if respond_to?(:current_user)
# Store these in a thread-safe location
Current.request_id = request_id
Current.user_id = user_id
# Tag all logs from this request
Rails.logger.tagged("REQUEST-#{request_id}") do
begin
log_request_start
yield
ensure
log_request_end
Current.clear_all
end
end
end
def log_request_start
Rails.logger.info(
action: "#{controller_name}##{action_name}",
method: request.method,
path: request.path,
params: filtered_params,
ip: request.remote_ip,
user_id: Current.user_id
)
end
def filtered_params
# Never log sensitive information
params.permit!.to_h.except('password', 'credit_card', 'token')
end
end
This approach gives me searchable, structured logs. When a user reports an issue, I can find all logs from their request using that unique ID. I can see the entire journey through my application.
I also make sure background jobs carry this context forward:
class ApplicationJob < ActiveJob::Base
around_perform do |job, block|
# Pass request context from the web request that queued this job
Current.request_id = job.arguments.dig(:metadata, :request_id)
Current.user_id = job.arguments.dig(:metadata, :user_id)
Rails.logger.tagged("JOB-#{job.job_id}") do
block.call
end
end
end
Now when a job runs, I know which web request triggered it. This connection between web requests and background processing is crucial for debugging complex issues.
Collecting Meaningful Metrics
Metrics give me numbers about my application’s health. I track things like response times, error rates, and business activities.
Here’s a simple metrics collector I might build:
class Metrics
# Use a thread-safe structure for in-memory aggregation
@@counters = Concurrent::Hash.new(0)
@@timings = Concurrent::Array.new
def self.increment(name, value = 1)
@@counters[name] += value
# Also send to external service if configured
if ENV['STATSD_HOST']
Statsd.new(ENV['STATSD_HOST']).increment(name, value)
end
end
def self.measure(name)
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
result = yield
duration_ms = (Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time) * 1000
@@timings << {name: name, duration: duration_ms}
# Keep only recent timings
@@timings.shift if @@timings.size > 1000
result
end
def self.get_counters
@@counters.dup
end
def self.get_timings(name)
@@timings.select { |t| t[:name] == name }.map { |t| t[:duration] }
end
end
# Then in my controllers:
class ProductsController < ApplicationController
def show
Metrics.measure('products.show') do
@product = Product.find(params[:id])
Metrics.increment('products.viewed')
render :show
end
end
end
I track business metrics too—not just technical ones:
class CheckoutController < ApplicationController
def create
order = Order.new(order_params)
if order.save
Metrics.increment('orders.created')
Metrics.increment("orders.value.total", order.total_cents)
Metrics.increment("orders.user.#{current_user.tier}") if current_user
redirect_to order_path(order)
else
Metrics.increment('orders.failed')
render :new
end
end
end
These business metrics help me understand how users interact with my application. I can see if a new feature increases conversions or if there’s a problem in the checkout flow.
Implementing Distributed Tracing
When a request touches multiple services—your Rails app, a background job, an external API—you need to follow that journey. Distributed tracing creates a breadcrumb trail.
class Trace
TRACE_ID_HEADER = 'X-Trace-Id'
SPAN_ID_HEADER = 'X-Span-Id'
def self.start_span(name, parent_trace_id: nil, parent_span_id: nil)
trace_id = parent_trace_id || generate_id
span_id = generate_id
span = {
trace_id: trace_id,
span_id: span_id,
parent_span_id: parent_span_id,
name: name,
start: Time.now.utc.iso8601(6),
pid: Process.pid,
thread: Thread.current.object_id
}
# Store in current context
Current.trace_id = trace_id
Current.span_id = span_id
begin
result = yield(span)
span[:end] = Time.now.utc.iso8601(6)
span[:duration_ms] = ((Time.parse(span[:end]) - Time.parse(span[:start])) * 1000).round(3)
span[:status] = 'success'
record_span(span)
result
rescue => e
span[:end] = Time.now.utc.iso8601(6)
span[:duration_ms] = ((Time.parse(span[:end]) - Time.parse(span[:start])) * 1000).round(3)
span[:status] = 'error'
span[:error] = e.message
record_span(span)
raise
end
end
def self.generate_id
SecureRandom.hex(16)
end
def self.record_span(span)
# Send to tracing backend or store locally
Rails.logger.debug("TRACE: #{span.to_json}")
# For a simple implementation, store in Redis
if Redis.current
key = "trace:#{span[:trace_id]}"
Redis.current.lpush(key, span.to_json)
Redis.current.expire(key, 3600) # Keep for 1 hour
end
end
end
# In a middleware to start traces for web requests:
class TracingMiddleware
def initialize(app)
@app = app
end
def call(env)
trace_id = env["HTTP_#{TRACE_ID_HEADER.gsub('-', '_').upcase}"] || Trace.generate_id
parent_span_id = env["HTTP_#{SPAN_ID_HEADER.gsub('-', '_').upcase}"]
Trace.start_span("request", parent_trace_id: trace_id, parent_span_id: parent_span_id) do |span|
span[:method] = env['REQUEST_METHOD']
span[:path] = env['PATH_INFO']
# Continue the request
status, headers, body = @app.call(env)
# Pass trace headers downstream
headers[TRACE_ID_HEADER] = trace_id
headers[SPAN_ID_HEADER] = span[:span_id]
[status, headers, body]
end
end
end
Now I can trace calls to external services:
class ExternalApiService
def fetch_data
Trace.start_span("external_api.fetch") do |span|
span[:service] = 'payment_gateway'
# Make HTTP request with trace headers
headers = {
'Content-Type' => 'application/json',
Trace::TRACE_ID_HEADER => Current.trace_id,
Trace::SPAN_ID_HEADER => Current.span_id
}
response = HTTParty.get('https://api.example.com/data', headers: headers)
span[:response_status] = response.code
response
end
end
end
When I need to debug a slow request, I can find the trace ID from logs and see exactly where time was spent across all services.
Comprehensive Error Tracking
Errors will happen. The goal is to catch them with enough context to fix them quickly.
class ErrorCapture
def self.capture(exception, context = {})
# Build error context
error_data = {
exception: exception.class.name,
message: exception.message,
backtrace: exception.backtrace.first(10),
timestamp: Time.now.utc.iso8601,
context: base_context.merge(context)
}
# Send to external service (Sentry, Rollbar, etc.)
send_to_external_service(error_data) if external_service_configured?
# Also log locally with structured format
Rails.logger.error("ERROR: #{error_data.to_json}")
# Store in database for custom error dashboards
ErrorRecord.create!(
error_class: exception.class.name,
message: exception.message.truncate(500),
backtrace: exception.backtrace.join("\n"),
context: error_data[:context].to_json
)
end
def self.base_context
{
request_id: Current.request_id,
user_id: Current.user_id,
params: Current.params,
url: Current.url,
# Don't include sensitive session data
session_keys: Current.session&.keys || []
}
end
def self.send_to_external_service(data)
# Example: Sending to Sentry
Raven.capture_exception(
data[:exception],
extra: data[:context],
tags: { request_id: data[:context][:request_id] }
) if defined?(Raven)
end
end
# In my application controller:
class ApplicationController < ActionController::Base
rescue_from StandardError, with: :capture_error
private
def capture_error(exception)
ErrorCapture.capture(exception, {
controller: self.class.name,
action: action_name,
user_agent: request.user_agent,
ip: request.remote_ip
})
# Re-raise for Rails' default error handling
raise exception
end
end
I also track errors in background jobs:
class ApplicationJob < ActiveJob::Base
rescue_from StandardError do |exception|
ErrorCapture.capture(exception, {
job_class: self.class.name,
arguments: job_arguments_safe,
queue_name: queue_name,
job_id: job_id
})
# Retry logic here
retry_job(wait: 1.minute) if executions < 3
end
def job_arguments_safe
arguments.map do |arg|
if arg.respond_to?(:to_global_id)
arg.to_global_id.to_s
elsif arg.is_a?(Hash)
arg.except('password', 'token')
else
arg
end
end
end
end
This gives me complete error visibility. I can see which errors affect which users, how often they occur, and what the user was doing when the error happened.
Performance Monitoring and Anomaly Detection
Performance issues often creep in slowly. I monitor for gradual degradation and sudden changes.
class PerformanceMonitor
# Track baseline performance for different operations
BASELINES = {
'db.query' => { p50: 50, p95: 200, p99: 500 }, # milliseconds
'view.render' => { p50: 100, p95: 300, p99: 1000 },
'api.call' => { p50: 200, p95: 1000, p99: 5000 }
}
def self.measure(operation, &block)
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
result = block.call
duration_ms = (Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time) * 1000
# Record the measurement
record_measurement(operation, duration_ms)
# Check against baseline
baseline = BASELINES[operation]
if baseline && duration_ms > baseline[:p99]
alert_slow_operation(operation, duration_ms, baseline[:p99])
end
result
end
def self.record_measurement(operation, duration_ms)
# Store in Redis with timestamp
key = "perf:#{operation}:#{Time.now.strftime('%Y%m%d%H%M')}"
Redis.current.multi do
Redis.current.rpush(key, duration_ms)
Redis.current.expire(key, 86400) # Keep for 24 hours
end
end
def self.alert_slow_operation(operation, actual, threshold)
Rails.logger.warn(
event: 'slow_operation',
operation: operation,
duration_ms: actual.round(2),
threshold_ms: threshold,
exceeded_by: "#{((actual - threshold) / threshold * 100).round(1)}%"
)
# Send alert if this happens repeatedly
alert_key = "slow_op_alert:#{operation}"
count = Redis.current.incr(alert_key)
Redis.current.expire(alert_key, 300) # 5 minute window
if count >= 3 # 3 slow operations in 5 minutes
AlertManager.trigger(
"Slow #{operation}",
"#{operation} exceeded p99 threshold #{threshold}ms with #{actual.round(2)}ms",
severity: :warning
)
end
end
end
# Use it to wrap database queries:
ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
# Skip schema queries
next if event.payload[:name] == 'SCHEMA'
PerformanceMonitor.measure('db.query') do
# The query already executed, we're just measuring
# Could add query-specific labels here
end
end
I also monitor memory usage:
class MemoryMonitor
def self.check
# Get RSS memory usage
rss_kb = `ps -o rss= -p #{Process.pid}`.to_i
rss_mb = rss_kb / 1024.0
# Get Ruby heap stats if available
if GC.respond_to?(:stat)
heap_stats = GC.stat
live_objects = heap_stats[:heap_live_slots] || 0
total_objects = heap_stats[:total_allocated_objects] || 0
end
{
rss_mb: rss_mb.round(2),
live_objects: live_objects,
total_objects: total_objects,
timestamp: Time.now.utc.iso8601
}
end
def self.run_periodic_check
Thread.new do
loop do
begin
stats = check
# Record for trending
Redis.current.lpush('memory:history', stats.to_json)
Redis.current.ltrim('memory:history', 0, 1440) # Keep 24 hours at 1-minute intervals
# Alert on high memory
if stats[:rss_mb] > 1024 # 1GB threshold
AlertManager.trigger(
'High Memory Usage',
"Process using #{stats[:rss_mb]}MB of RAM",
severity: :warning
)
end
rescue => e
Rails.logger.error("Memory check failed: #{e.message}")
end
sleep 60 # Check every minute
end
end
end
end
# Start monitoring when Rails boots
MemoryMonitor.run_periodic_check if Rails.env.production?
This continuous monitoring helps me spot trends. I can see if memory usage is slowly increasing over days, indicating a potential leak. I can detect when response times start creeping up before users complain.
Health Checks and Dependency Monitoring
My application depends on external services—databases, caches, APIs. I need to know when these dependencies fail.
class HealthCheck
CHECKS = [
{
name: 'database',
critical: true,
check: -> { ActiveRecord::Base.connection.execute('SELECT 1') },
timeout: 5
},
{
name: 'redis',
critical: true,
check: -> { Redis.current.ping == 'PONG' },
timeout: 2
},
{
name: 'elasticsearch',
critical: false,
check: -> { Elasticsearch::Model.client.ping },
timeout: 3
},
{
name: 'storage',
critical: false,
check: -> do
# Check if we can write to storage
test_file = Rails.root.join('tmp', 'healthcheck.txt')
File.write(test_file, 'test')
File.read(test_file) == 'test'
ensure
File.delete(test_file) rescue nil
end,
timeout: 5
}
]
def self.run_all
results = {}
CHECKS.each do |check|
results[check[:name]] = run_single_check(check)
end
build_summary(results)
end
def self.run_single_check(check)
start_time = Time.now
begin
Timeout.timeout(check[:timeout]) do
check[:check].call
{
status: 'healthy',
duration_ms: ((Time.now - start_time) * 1000).round(2)
}
end
rescue Timeout::Error
{
status: 'timeout',
duration_ms: check[:timeout] * 1000
}
rescue => e
{
status: 'unhealthy',
error: e.message,
duration_ms: ((Time.now - start_time) * 1000).round(2)
}
end
end
def self.build_summary(results)
all_healthy = results.values.all? { |r| r[:status] == 'healthy' }
critical_healthy = results.select do |name, _|
CHECKS.find { |c| c[:name] == name }[:critical]
end.values.all? { |r| r[:status] == 'healthy' }
overall_status = if all_healthy
'healthy'
elsif critical_healthy
'degraded'
else
'unhealthy'
end
{
status: overall_status,
timestamp: Time.now.utc.iso8601,
checks: results,
uptime: Rails.application.config.uptime_start ?
(Time.now - Rails.application.config.uptime_start).round : nil
}
end
end
# Add a health endpoint
class HealthController < ApplicationController
skip_before_action :authenticate_user
skip_before_action :verify_authenticity_token
def show
health_data = HealthCheck.run_all
render json: health_data, status: health_status(health_data[:status])
end
private
def health_status(status)
case status
when 'healthy' then :ok
when 'degraded' then :service_unavailable
when 'unhealthy' then :service_unavailable
end
end
end
I also set up external monitoring to hit this endpoint:
# In config/routes.rb
get '/health', to: 'health#show'
get '/health/detailed', to: 'health#detailed' # More detailed version
Load balancers and orchestration systems can use these endpoints to determine if my application is ready to receive traffic.
Smart Alerting and Notification
Alerts should tell me what’s wrong and what to do about it. Too many alerts cause alert fatigue, where important warnings get ignored.
class AlertManager
# Define alert levels and who to notify
LEVELS = {
critical: {
channels: [:pagerduty, :slack_ops, :email_admin],
cooldown: 5.minutes,
repeat_every: 30.minutes
},
warning: {
channels: [:slack_ops, :email_team],
cooldown: 15.minutes,
repeat_every: 2.hours
},
info: {
channels: [:slack_general],
cooldown: 1.hour
}
}
def self.trigger(alert_key, message, level: :warning, details: {})
# Check if we're in cooldown for this alert
cooldown_key = "alert_cooldown:#{alert_key}:#{level}"
return if Redis.current.exists(cooldown_key)
# Build alert payload
alert = {
id: SecureRandom.uuid,
key: alert_key,
level: level,
message: message,
details: details,
timestamp: Time.now.utc.iso8601,
environment: Rails.env,
host: Socket.gethostname
}
# Store alert
store_alert(alert)
# Send notifications
LEVELS[level][:channels].each do |channel|
send_to_channel(channel, alert)
end
# Set cooldown
Redis.current.setex(
cooldown_key,
LEVELS[level][:cooldown].to_i,
'1'
)
# Schedule repeat if needed
if LEVELS[level][:repeat_every]
schedule_repeat(alert_key, message, level, details, LEVELS[level][:repeat_every])
end
alert
end
def self.store_alert(alert)
# Store in Redis for recent alerts
Redis.current.lpush('alerts:recent', alert.to_json)
Redis.current.ltrim('alerts:recent', 0, 99) # Keep 100 most recent
# Also store in database for long-term retention
Alert.create!(
alert_id: alert[:id],
key: alert[:key],
level: alert[:level],
message: alert[:message],
details: alert[:details],
environment: alert[:environment]
)
end
def self.send_to_channel(channel, alert)
case channel
when :slack_ops
send_slack_message(
ENV['SLACK_OPS_WEBHOOK'],
format_for_slack(alert)
)
when :pagerduty
send_pagerduty_alert(alert)
when :email_admin
AdminMailer.alert(alert).deliver_later
end
end
def self.format_for_slack(alert)
color = case alert[:level]
when :critical then '#ff0000'
when :warning then '#ffcc00'
when :info then '#00ccff'
end
{
attachments: [{
color: color,
title: "#{alert[:level].to_s.upcase}: #{alert[:message]}",
text: "Environment: #{alert[:environment]}\nHost: #{alert[:host]}",
fields: alert[:details].map { |k, v| { title: k.to_s, value: v.to_s, short: true } },
timestamp: alert[:timestamp]
}]
}
end
def self.schedule_repeat(alert_key, message, level, details, interval)
# Schedule a repeat check
AlertRepeatJob.set(wait: interval).perform_later(
alert_key, message, level, details
)
end
end
# Job to check if alert is still relevant
class AlertRepeatJob < ApplicationJob
def perform(alert_key, original_message, level, original_details)
# Check if condition still exists
condition_still_exists = check_condition(alert_key)
if condition_still_exists
# Update message to indicate it's ongoing
ongoing_message = "#{original_message} (ongoing for #{(arguments.last[:interval] / 60).to_i} minutes)"
AlertManager.trigger(
alert_key,
ongoing_message,
level: level,
details: original_details.merge(repeated: true)
)
end
end
def check_condition(alert_key)
# This would check the specific condition
# For example, for a high error rate alert:
if alert_key.start_with?('error_rate.')
# Calculate current error rate
error_rate = calculate_current_error_rate
threshold = alert_key.split('.').last.to_f
error_rate > threshold
else
# Default: assume condition still exists
true
end
end
end
I set up specific alert conditions:
# Monitor error rates
Thread.new do
loop do
begin
# Calculate error rate for last 5 minutes
five_min_ago = 5.minutes.ago
error_count = RequestLog
.where('created_at > ?', five_min_ago)
.where(status: 500..599)
.count
total_count = RequestLog
.where('created_at > ?', five_min_ago)
.count
error_rate = total_count > 0 ? error_count.to_f / total_count : 0
if error_rate > 0.05 # 5% error rate
AlertManager.trigger(
"error_rate.high",
"High error rate detected: #{(error_rate * 100).round(1)}%",
level: :critical,
details: {
error_rate: error_rate,
error_count: error_count,
total_requests: total_count,
period_minutes: 5
}
)
end
rescue => e
Rails.logger.error("Error rate monitoring failed: #{e.message}")
end
sleep 60 # Check every minute
end
end
This alerting system ensures I know about problems quickly, but I’m not bombarded with repeated notifications for the same issue.
Putting It All Together
Implementing these strategies gives me confidence that I’ll know about problems before users do. I can see performance trends, catch errors quickly, and understand the complete journey of any request through my system.
The key is starting simple and adding complexity as needed. Begin with structured logging and error tracking. Add metrics and health checks as your application grows. Implement distributed tracing when you add background jobs or external service calls.
Remember that monitoring is not a one-time setup. As your application evolves, your monitoring needs will change. Review what you’re tracking regularly. Remove metrics that aren’t useful. Add tracking for new features and services.
Most importantly, make sure someone is watching the alerts and knows how to respond. The best monitoring system in the world is useless if nobody acts on what it tells you.
These strategies have helped me maintain reliable Rails applications that scale with user growth. They provide the visibility I need to understand production behavior and the tools to quickly resolve issues when they occur.