How to Build Complete Observability in Java Applications: Logs, Metrics, Traces and Monitoring
Discover how to implement complete observability in Java applications with structured logging, metrics, tracing & alerts to transform debugging chaos into clear insights.
I remember staring at a wall of text logs at 3 a.m., trying to figure out why our application was slow. There were thousands of lines, all shouting different errors and warnings. I felt lost. That was when I realized that logs alone are not enough to truly understand a complex, modern Java application. You need a complete picture. You need to see the numbers, follow the journey of a request, and connect all the dots. This is what we call observability.
Let’s talk about how to build that picture, piece by piece.
The first step is to fix your logs. Plain text logs are a dead end when you’re dealing with thousands of requests per second. You need structure. Instead of a line that says ERROR: Failed to process order for user 123, you need something a machine can easily read and query. Think of it like switching from a handwritten diary to a well-organized spreadsheet.
This is where structured logging comes in. You format your log entries as JSON. Every piece of information becomes a labeled field. Here’s how you can set it up using a common logging framework.
// This is a configuration snippet for logback.xml, a common logging configuration file.
<appender name="JSON_CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<!-- We format the timestamp in a standard way -->
<timestampPattern>yyyy-MM-dd'T'HH:mm:ss.SSS'Z'</timestampPattern>
<!-- We include specific context fields we set in our code -->
<includeMdcKeyName>traceId,userId,sessionId</includeMdcKeyName>
<!-- We can add constant fields to every log line -->
<customFields>{"appname":"order-service","version":"2.1.0"}</customFields>
</encoder>
</appender>
<root level="info">
<appender-ref ref="JSON_CONSOLE" />
</root>
Now, when your application logs an error, it doesn’t just print a line. It outputs a complete JSON object.
{
"@timestamp": "2023-10-27T03:45:12.123Z",
"level": "ERROR",
"logger_name": "com.example.OrderService",
"message": "Failed to process order",
"traceId": "abc123def456",
"userId": "789",
"appname": "order-service",
"stack_trace": "..."
}
Suddenly, in your log management tool, you can run a query like traceId:"abc123def456" and see every single log entry—info, debug, error—from that specific request, across all the different parts of your code. It changes the game.
But logs tell you what happened. To understand how your system is behaving as a whole, you need numbers. You need metrics. How many requests are we getting per second? What’s the average response time? How many are failing? This is the pulse of your application.
For this, I use Micrometer. It’s a toolkit that lets you record metrics in a standard way, without locking you into a specific monitoring system.
Imagine you want to track how many times a specific API endpoint is called and how long it takes.
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
@RestController
public class UserController {
// Micrometer provides a registry where all metrics are stored
private final MeterRegistry meterRegistry;
private final Counter requestCounter;
public UserController(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
// Build a counter for HTTP requests to this endpoint
this.requestCounter = Counter.builder("api.calls")
.tag("endpoint", "/api/users")
.tag("method", "GET")
.description("Total number of calls to GET /api/users")
.register(meterRegistry);
}
@GetMapping("/api/users")
public List<User> getUsers() {
// Start timing the request
Timer.Sample sample = Timer.start(meterRegistry);
requestCounter.increment(); // Count the call
List<User> users;
try {
users = userService.findAll(); // Your business logic
return users;
} finally {
// Stop the timer and record the duration
sample.stop(Timer.builder("api.duration")
.tag("endpoint", "/api/users")
.tag("method", "GET")
.publishPercentileHistogram() // Enables detailed latency analysis
.register(meterRegistry));
}
}
}
Now you have a counter ticking up with every call and a timer capturing how long each call takes. You can see if the count suddenly drops to zero (maybe the service is down) or if the duration slowly creeps up (maybe the database is getting slower).
Logs and metrics are powerful, but in a world where a single user request might touch five different microservices, you get lost again. Which service caused the delay? Did the payment service call the fraud service, or did it time out? To answer this, you need to follow the request. You need distributed tracing.
This is where OpenTelemetry shines. It helps you create a “trace” — a recorded journey of a request through your system. Each unit of work in that journey is a “span.”
Let me show you how to instrument a simple service.
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.SpanKind;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
@RestController
public class OrderController {
// Get a Tracer instance for your application
private final Tracer tracer = GlobalOpenTelemetry.getTracer("order-service");
@PostMapping("/order")
public ResponseEntity<String> createOrder(@RequestBody Order order) {
// Start a new span for this API handler. This is a SERVER span because it's receiving a request.
Span span = tracer.spanBuilder("POST /order")
.setSpanKind(SpanKind.SERVER)
.startSpan();
// Make this span the active span for the current block of code.
try (Scope scope = span.makeCurrent()) {
// Add useful attributes to the span for later filtering
span.setAttribute("user.id", order.getUserId());
span.setAttribute("order.amount", order.getTotalAmount());
// Your business logic here. Let's say it has two main parts.
processInventory(order); // This will create its own child span
chargePayment(order); // This will create another child span
span.addEvent("Order processed successfully");
return ResponseEntity.ok("Order placed");
} catch (Exception e) {
// Record the error on the span
span.recordException(e);
span.setStatus(StatusCode.ERROR, "Order processing failed");
return ResponseEntity.status(500).body("Error");
} finally {
// The span must be ended.
span.end();
}
}
private void processInventory(Order order) {
// Start a child span for this internal operation
Span childSpan = tracer.spanBuilder("processInventory")
.startSpan();
try (Scope childScope = childSpan.makeCurrent()) {
// ... inventory logic ...
Thread.sleep(50); // Simulating some work
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} finally {
childSpan.end();
}
}
}
When this runs, a tracing tool will show you a waterfall diagram. You’ll see the parent POST /order span, and inside it, the processInventory child span, perfectly timed. If the chargePayment method calls another microservice, OpenTelemetry automatically propagates the trace ID in the HTTP headers, linking the spans across service boundaries. You can literally see the request flow from one service to the next.
Now, here’s the magic trick to tie logs and traces together. You need to put that unique trace ID into every single log line that is part of that request. This is done using the Mapped Diagnostic Context, or MDC.
Think of MDC as a per-thread key-value store that lasts for the duration of a request. You put the trace ID in at the start, and every log statement automatically includes it.
import org.slf4j.MDC;
import org.springframework.web.filter.OncePerRequestFilter;
import jakarta.servlet.FilterChain;
import jakarta.servlet.http.HttpServletRequest;
import jakarta.servlet.http.HttpServletResponse;
@Component
public class TraceIdFilter extends OncePerRequestFilter {
@Override
protected void doFilterInternal(HttpServletRequest request,
HttpServletResponse response,
FilterChain filterChain) throws IOException, ServletException {
// Try to get the trace ID from the incoming request header.
// OpenTelemetry often uses 'traceparent'.
String traceId = request.getHeader("traceparent");
if (traceId == null) {
// If it's a new request, generate a unique ID.
traceId = java.util.UUID.randomUUID().toString();
}
// Put it into the MDC. The key "traceId" is what our JSON log encoder is looking for.
MDC.put("traceId", traceId);
// Also put it in the response headers so the next service can pick it up.
response.setHeader("X-Trace-ID", traceId);
try {
// Continue processing the request.
filterChain.doFilter(request, response);
} finally {
// Crucial: Clear the MDC after the request is done.
// This prevents trace IDs from leaking into other requests on the same thread.
MDC.clear();
}
}
}
With this filter in place, every log statement from controllers, services, and even third-party libraries that use SLF4J will have the traceId field in its JSON output. When you get an alert, you find the trace ID, and with one search, you have the complete story: the metrics for that endpoint, the trace diagram showing the flow, and every related log line, all connected.
Collecting metrics is pointless if you can’t store and query them. One of the most popular systems for this is Prometheus. It works on a “pull” model—it periodically scrapes a metrics endpoint from your application. Setting this up in a Spring Boot application is straightforward.
You don’t need to write the endpoint yourself. Spring Boot Actuator with Micrometer does it for you.
# In your application.yml file
management:
endpoints:
web:
exposure:
include: health,metrics,prometheus # Expose the health, metrics, and prometheus endpoints
metrics:
export:
prometheus:
enabled: true # Ensure the Prometheus-formatted endpoint is active
Once your app is running, hit http://your-app:8080/actuator/prometheus. You’ll see plain-text output that looks like this:
# HELP api_calls_total Total number of calls to GET /api/users
# TYPE api_calls_total counter
api_calls_total{endpoint="/api/users",method="GET",} 1427.0
# HELP api_duration_seconds Duration of API calls
# TYPE api_duration_seconds histogram
api_duration_seconds_bucket{endpoint="/api/users",method="GET",le="0.005",} 100.0
api_duration_seconds_bucket{endpoint="/api/users",method="GET",le="0.01",} 300.0
api_duration_seconds_bucket{endpoint="/api/users",method="GET",le="0.025",} 1200.0
api_duration_seconds_sum{endpoint="/api/users",method="GET",} 45.67
api_duration_seconds_count{endpoint="/api/users",method="GET",} 1427.0
Prometheus will scrape this page every 15-30 seconds, store the numbers, and let you query them with a powerful language called PromQL. You can write queries like rate(api_calls_total[5m]) to get the calls per second over the last 5 minutes, or histogram_quantile(0.95, rate(api_duration_seconds_bucket[5m])) to get the 95th percentile latency.
Knowing your app is up is one thing. Knowing it’s truly healthy is another. A health endpoint that just returns {"status": "UP"} is not very useful. What if your app is running but its connection to the database is broken? It’s not healthy. You need to check your dependencies.
Spring Boot Actuator gives you a basic health endpoint, but you can easily extend it.
import org.springframework.boot.actuate.health.Health;
import org.springframework.boot.actuate.health.HealthIndicator;
import org.springframework.stereotype.Component;
import java.sql.Connection;
import java.sql.SQLException;
import javax.sql.DataSource;
@Component
public class CustomDatabaseHealthIndicator implements HealthIndicator {
private final DataSource dataSource;
public CustomDatabaseHealthIndicator(DataSource dataSource) {
this.dataSource = dataSource;
}
@Override
public Health health() {
// Try to get a connection and run a trivial query
try (Connection connection = dataSource.getConnection()) {
if (connection.isValid(2)) { // 2 second timeout
// You can add detailed information
return Health.up()
.withDetail("database", "primary")
.withDetail("validationQuery", "isValid() succeeded")
.build();
} else {
return Health.down().withDetail("database", "connection invalid").build();
}
} catch (SQLException e) {
return Health.down(e) // This will include the exception message
.withDetail("database", "primary")
.build();
}
}
}
Now your /actuator/health endpoint might return:
{
"status": "UP",
"components": {
"db": {
"status": "UP",
"details": {
"database": "primary",
"validationQuery": "isValid() succeeded"
}
},
"diskSpace": { "status": "UP", "details": { ... } },
"ping": { "status": "UP" }
}
}
In Kubernetes, you can point the liveness probe to /actuator/health/liveness and the readiness probe to /actuator/health/readiness. If the database goes down, the readiness probe fails, Kubernetes stops sending new traffic to that pod, but leaves it running (hoping the database comes back). If the application itself is deadlocked, the liveness probe fails, and Kubernetes kills and restarts the pod. This is fundamental for resilient systems.
Sometimes, you need to go deeper than metrics and traces. You need to know which specific method is using all the CPU, or what is creating millions of tiny objects that trigger garbage collection. For this, you need a profiler. And for Java, the best tool for production is Java Flight Recorder (JFR). The best part? Its overhead is so low you can run it all the time.
You don’t need to modify your code. You just enable it when you start your application.
# Running your Jar with continuous recording
java \
-XX:StartFlightRecording=disk=true, # Write to disk
maxsize=1G, # Max file size 1GB
maxage=24h, # Keep recordings for 24 hours
name=ContinuousRecording, # Name of the recording
path-to-gc-roots=true \ # Useful for memory leak analysis
-jar your-application.jar
This runs in the background, collecting a huge amount of detailed profiling data with minimal impact—often quoted as less than 1% overhead. When a performance issue occurs, you have a recording that covers the exact time period. You can dump it to a file.
# Find the Java process ID (PID)
jcmd
# Dump the last 5 minutes of the continuous recording to a file
jcmd <PID> JFR.dump name=ContinuousRecording filename=./incident.jfr duration=5m
Then, you open this incident.jfr file with JDK Mission Control. You can see a flame graph of CPU usage, a timeline of garbage collections, a list of the most allocated object types, and even which threads were blocked on locks. It turns “the application is slow” into “the processReport() method is spending 40% of its time in the generatePDF() method, which is doing excessive string concatenation.” That’s actionable.
For certain actions, especially around security and compliance, you need a special kind of log. An audit log. This is an immutable record of who did what and when. It’s separate from your debug logs and needs to be handled with more care.
You want to ensure these logs are always written, cannot be tampered with, and go to a secure location. Here’s a way to implement it using Spring’s AOP (Aspect-Oriented Programming).
First, define a custom annotation to mark methods that need auditing.
import java.lang.annotation.*;
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface Audited {
String action(); // e.g., "USER_LOGIN", "DATA_EXPORT"
}
Now, create an aspect that intercepts calls to methods with this annotation.
import org.aspectj.lang.JoinPoint;
import org.aspectj.lang.annotation.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.security.core.context.SecurityContextHolder;
import org.springframework.stereotype.Component;
@Aspect
@Component
public class AuditLogAspect {
// Use a dedicated logger for audit events. It can be configured to write to a separate file or system.
private static final Logger AUDIT_LOG = LoggerFactory.getLogger("AUDIT_LOGGER");
@AfterReturning("@annotation(audited)")
public void auditSuccess(JoinPoint joinPoint, Audited audited) {
String username = "anonymous";
try {
// Get the current authenticated user
username = SecurityContextHolder.getContext().getAuthentication().getName();
} catch (Exception e) {
// If no auth context, keep as "anonymous"
}
String action = audited.action();
Object[] args = joinPoint.getArgs();
// Log a structured success event. Keep it factual.
AUDIT_LOG.info("type=AUDIT, status=SUCCESS, user={}, action={}, parameters={}",
username, action, args);
}
@AfterThrowing(pointcut="@annotation(audited)", throwing="error")
public void auditFailure(JoinPoint joinPoint, Audited audited, Throwable error) {
String username = "anonymous";
try {
username = SecurityContextHolder.getContext().getAuthentication().getName();
} catch (Exception e) { }
String action = audited.action();
// Log the failure with the reason
AUDIT_LOG.warn("type=AUDIT, status=FAILURE, user={}, action={}, reason={}",
username, action, error.getMessage());
}
}
Now, you can simply annotate any sensitive method.
@Service
public class UserService {
@Audited(action = "USER_PASSWORD_CHANGE")
public void changePassword(String userId, String newPassword) {
// ... business logic to change password ...
}
}
The audit log will contain clear, searchable records of every password change attempt, who made it, and whether it succeeded. You should configure your logging framework to send logs from the AUDIT_LOGGER directly to a secure, append-only data store.
With all this data flowing—logs to an aggregator like Elasticsearch, metrics to Prometheus, traces to Jaeger—you need a way to look at it. This is where a dashboard tool like Grafana comes in. It can query data from all these different sources and display them on a single screen.
You don’t build one giant dashboard. You build several focused ones. An “On-Call” dashboard for the engineer responding to an alert might show top-level service health, error rates, and latency. A “Business” dashboard might show orders per minute or user sign-ups. The power is in linking them.
For example, in Grafana, you can create a graph from Prometheus data showing latency. If you’ve configured exemplars, you might see small diamonds on the graph at high-latency points. Clicking a diamond can take you directly to the Jaeger trace for that specific slow request, which in turn contains the trace ID to search for logs. The loop is closed.
Finally, all this observability is for nothing if no one knows when things go wrong. You need alerts. But the worst thing you can do is alert on every little thing. You’ll get alert fatigue and start ignoring them. The key is to alert on symptoms that matter to users, not on internal technical events.
A good alert answers: “Is a user seeing a problem right now?”
Here’s an example of a Prometheus alerting rule that focuses on a user-facing symptom: a high error rate for an important API.
# prometheus-alerts.yml
groups:
- name: api_errors
rules:
- alert: HighApiErrorRate
# This expression calculates the ratio of 5xx errors to total requests over the last 2 minutes.
expr: |
sum(rate(http_server_requests_seconds_count{status=~"5..", uri="/api/checkout"}[2m]))
/
sum(rate(http_server_requests_seconds_count{uri="/api/checkout"}[2m]))
> 0.02 # Alert if more than 2% of requests are errors
for: 1m # Wait for this condition to be true for 1 minute before firing (prevents brief spikes)
labels:
severity: critical
team: payments
annotations:
summary: "High error rate on checkout API"
description: "The error rate for POST /api/checkout is {{ $value }}%. This is impacting users."
runbook: "https://wiki.example.com/runbooks/checkout-errors"
This alert doesn’t care if the CPU is high or memory is low. It cares if users are getting errors. When this alert fires, someone needs to look at it. They have the dashboard, the traces, and the logs—all connected—to find out why.
Putting it all together, observability is not a single tool or a checkbox. It’s a practice. It starts with instrumenting your code to emit structured logs, metrics, and traces. It grows by connecting that data together with common identifiers. It matures by building thoughtful dashboards and setting precise, actionable alerts.
The goal is to move from asking “Is it working?” to asking “How is it working?” and finally to being able to answer “Why is it working this way?” When you can do that, you stop fearing production. You start understanding it. And that is the most powerful place a developer can be.