Monitoring and metrics collection are critical aspects of managing Java applications in production environments. Effective monitoring helps identify performance bottlenecks, predict system failures, and ensure optimal user experience. I’ve implemented several monitoring solutions throughout my career and found that a comprehensive approach provides the best results.
JMX for Runtime Metrics Collection
Java Management Extensions (JMX) provides a standard way to monitor and manage applications. It allows real-time access to application metrics without modifying source code.
@MXBean
public interface ApplicationMetrics {
long getActiveUsers();
long getRequestsPerMinute();
}
public class ApplicationMetricsImpl implements ApplicationMetrics {
private final AtomicLong activeUsers = new AtomicLong(0);
private final AtomicLong requestsPerMinute = new AtomicLong(0);
public ApplicationMetricsImpl() {
MBeanServer server = ManagementFactory.getPlatformMBeanServer();
try {
ObjectName name = new ObjectName("com.example:type=ApplicationMetrics");
server.registerMBean(this, name);
} catch (Exception e) {
throw new RuntimeException("Failed to register MBean", e);
}
}
@Override
public long getActiveUsers() {
return activeUsers.get();
}
@Override
public long getRequestsPerMinute() {
return requestsPerMinute.get();
}
public void incrementUsers() {
activeUsers.incrementAndGet();
}
public void decrementUsers() {
activeUsers.decrementAndGet();
}
public void recordRequest() {
requestsPerMinute.incrementAndGet();
}
}
JMX metrics can be accessed through tools like JConsole or JVisualVM. I’ve found them particularly useful for quick diagnostics without deploying additional monitoring infrastructure. These tools connect directly to the JVM process and provide real-time data visualization.
Micrometer Integration for Metrics Registry
Micrometer provides a vendor-neutral facade for monitoring systems, allowing metrics collection without tying the application to specific monitoring systems.
public class MetricsService {
private final MeterRegistry registry;
public MetricsService(MeterRegistry registry) {
this.registry = registry;
}
public void recordRequestLatency(String endpoint, long latencyMs) {
Timer timer = registry.timer("http.request.latency", "endpoint", endpoint);
timer.record(latencyMs, TimeUnit.MILLISECONDS);
}
public void incrementCounter(String name, String... tags) {
registry.counter(name, tags).increment();
}
public void recordGaugeValue(String name, double value, String... tags) {
Gauge.builder(name, () -> value)
.tags(tags)
.register(registry);
}
public void recordHistogram(String name, double value, String... tags) {
DistributionSummary summary = registry.summary(name, tags);
summary.record(value);
}
}
I’ve integrated Micrometer with Spring Boot applications to great effect. The ability to switch between different monitoring backends (Prometheus, Datadog, etc.) without changing application code is invaluable when moving between different cloud providers or environments.
Health Check Implementation
Health checks provide a simple way to verify if application components are functioning correctly. They’re essential for container orchestration systems like Kubernetes.
public enum HealthStatus {
UP, DOWN, DEGRADED
}
public interface HealthIndicator {
String getName();
HealthStatus check();
}
public class DatabaseHealthIndicator implements HealthIndicator {
private final DataSource dataSource;
public DatabaseHealthIndicator(DataSource dataSource) {
this.dataSource = dataSource;
}
@Override
public String getName() {
return "database";
}
@Override
public HealthStatus check() {
try (Connection conn = dataSource.getConnection();
PreparedStatement stmt = conn.prepareStatement("SELECT 1")) {
stmt.execute();
return HealthStatus.UP;
} catch (SQLException e) {
return HealthStatus.DOWN;
}
}
}
public class HealthCheckService {
private final List<HealthIndicator> indicators = new ArrayList<>();
public void registerIndicator(HealthIndicator indicator) {
indicators.add(indicator);
}
public Map<String, HealthStatus> checkHealth() {
Map<String, HealthStatus> statuses = new HashMap<>();
for (HealthIndicator indicator : indicators) {
statuses.put(indicator.getName(), indicator.check());
}
return statuses;
}
public HealthStatus getOverallStatus() {
if (checkHealth().values().contains(HealthStatus.DOWN)) {
return HealthStatus.DOWN;
} else if (checkHealth().values().contains(HealthStatus.DEGRADED)) {
return HealthStatus.DEGRADED;
} else {
return HealthStatus.UP;
}
}
}
Health checks should be lightweight and fast. In my experience, they should complete within milliseconds to avoid impacting system performance.
Distributed Tracing with OpenTelemetry
Distributed tracing is critical for understanding request flows in microservice architectures. OpenTelemetry provides a standardized way to collect and export trace data.
public class TracingService {
private final Tracer tracer;
public TracingService(Tracer tracer) {
this.tracer = tracer;
}
public <T> T traceOperation(String operationName, Supplier<T> operation) {
Span span = tracer.spanBuilder(operationName).startSpan();
try (Scope scope = span.makeCurrent()) {
return operation.get();
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR);
throw e;
} finally {
span.end();
}
}
public void addSpanAttribute(String key, String value) {
Span current = Span.current();
if (current.isRecording()) {
current.setAttribute(key, value);
}
}
public Span createChildSpan(String name) {
return tracer.spanBuilder(name)
.setParent(Context.current().with(Span.current()))
.startSpan();
}
}
Setting up OpenTelemetry in a production environment:
public class OpenTelemetryConfig {
public static SdkTracerProvider initTracing() {
Resource resource = Resource.getDefault()
.merge(Resource.create(Attributes.of(
ResourceAttributes.SERVICE_NAME, "my-service",
ResourceAttributes.SERVICE_VERSION, "1.0.0"
)));
SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder()
.setResource(resource)
.addSpanProcessor(BatchSpanProcessor.builder(
OtlpGrpcSpanExporter.builder()
.setEndpoint("http://otel-collector:4317")
.build())
.build())
.build();
OpenTelemetrySdk sdk = OpenTelemetrySdk.builder()
.setTracerProvider(sdkTracerProvider)
.build();
OpenTelemetry.set(sdk);
return sdkTracerProvider;
}
}
From my experience, effective distributed tracing has helped reduce debugging time by up to 80% in complex microservice architectures.
Prometheus Integration
Prometheus has become a standard for metrics collection in cloud-native applications. Its pull-based model and powerful query language make it suitable for various monitoring scenarios.
public class PrometheusConfig {
public HTTPServer configureMetricsEndpoint() throws IOException {
HTTPServer server = new HTTPServer.Builder()
.withPort(8080)
.withRegistry(CollectorRegistry.defaultRegistry)
.build();
// Register custom metrics
Counter requestsTotal = Counter.build()
.name("app_requests_total")
.help("Total number of requests")
.labelNames("method", "endpoint", "status")
.register();
Gauge activeConnections = Gauge.build()
.name("app_active_connections")
.help("Current number of active connections")
.register();
Histogram responseTime = Histogram.build()
.name("app_response_time_seconds")
.help("Response time in seconds")
.buckets(0.1, 0.3, 0.5, 0.7, 1, 3, 5, 10)
.register();
return server;
}
public void recordRequest(String method, String endpoint, int status) {
Counter.Child counter = requestsTotal.labels(method, endpoint, String.valueOf(status));
counter.inc();
}
}
I’ve found that defining appropriate buckets for histogram metrics is crucial. They should reflect the expected distribution of values and highlight outliers effectively.
Log Aggregation
Centralized logging is essential for troubleshooting issues in distributed systems. Configuring proper log formatting and aggregation tools ensures quick access to relevant information.
public class LoggingConfig {
public static void configureLogback() {
LoggerContext context = (LoggerContext) LoggerFactory.getILoggerFactory();
PatternLayoutEncoder encoder = new PatternLayoutEncoder();
encoder.setPattern("%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] [%X{traceId},%X{spanId}] %-5level %logger{36} - %msg%n");
encoder.setContext(context);
encoder.start();
FileAppender<ILoggingEvent> fileAppender = new FileAppender<>();
fileAppender.setFile("application.log");
fileAppender.setEncoder(encoder);
fileAppender.setContext(context);
fileAppender.start();
ConsoleAppender<ILoggingEvent> consoleAppender = new ConsoleAppender<>();
consoleAppender.setEncoder(encoder);
consoleAppender.setContext(context);
consoleAppender.start();
Logger rootLogger = (Logger) LoggerFactory.getLogger(Logger.ROOT_LOGGER_NAME);
rootLogger.detachAndStopAllAppenders();
rootLogger.addAppender(fileAppender);
rootLogger.addAppender(consoleAppender);
// Configure JSON appender for log aggregation tools
LogstashEncoder logstashEncoder = new LogstashEncoder();
logstashEncoder.setContext(context);
logstashEncoder.start();
FileAppender<ILoggingEvent> jsonAppender = new FileAppender<>();
jsonAppender.setFile("application-json.log");
jsonAppender.setEncoder(logstashEncoder);
jsonAppender.setContext(context);
jsonAppender.start();
rootLogger.addAppender(jsonAppender);
}
}
Including trace and span IDs in logs makes it possible to correlate log entries with distributed traces, providing a complete picture of request processing.
Performance Profiling
Profiling helps identify performance bottlenecks by measuring execution times and resource usage of specific code sections.
public class ApplicationProfiler {
private static final Logger logger = LoggerFactory.getLogger(ApplicationProfiler.class);
private static final Map<String, Long> startTimes = new ConcurrentHashMap<>();
private static final Map<String, DescriptiveStatistics> statistics = new ConcurrentHashMap<>();
public static void startTimer(String operationId) {
startTimes.put(operationId, System.nanoTime());
}
public static long stopTimer(String operationId) {
Long startTime = startTimes.remove(operationId);
if (startTime == null) {
logger.warn("No start time found for operation {}", operationId);
return -1;
}
long duration = System.nanoTime() - startTime;
double durationMs = duration / 1_000_000.0;
statistics.computeIfAbsent(operationId, k -> new DescriptiveStatistics())
.addValue(durationMs);
logger.info("Operation {} took {:.2f} ms", operationId, durationMs);
return duration;
}
public static Map<String, Map<String, Double>> getStatistics() {
Map<String, Map<String, Double>> result = new HashMap<>();
for (Map.Entry<String, DescriptiveStatistics> entry : statistics.entrySet()) {
DescriptiveStatistics stats = entry.getValue();
Map<String, Double> operationStats = new HashMap<>();
operationStats.put("min", stats.getMin());
operationStats.put("max", stats.getMax());
operationStats.put("mean", stats.getMean());
operationStats.put("p50", stats.getPercentile(50));
operationStats.put("p95", stats.getPercentile(95));
operationStats.put("p99", stats.getPercentile(99));
result.put(entry.getKey(), operationStats);
}
return result;
}
public static void resetStatistics() {
statistics.clear();
}
}
Using aspect-oriented programming for automatic profiling:
@Aspect
@Component
public class PerformanceMonitoringAspect {
@Around("@annotation(Profiled)")
public Object profileMethod(ProceedingJoinPoint joinPoint) throws Throwable {
String methodName = joinPoint.getSignature().toShortString();
ApplicationProfiler.startTimer(methodName);
try {
return joinPoint.proceed();
} finally {
ApplicationProfiler.stopTimer(methodName);
}
}
}
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.METHOD)
public @interface Profiled {
}
I’ve often found that performance issues are concentrated in a small percentage of code. Targeted profiling helps identify these areas without adding significant overhead.
Resource Monitoring
Monitoring system resources (CPU, memory, disk, network) helps detect issues before they impact users. Automated alerts can notify teams when resources approach critical levels.
public class ResourceMonitor {
private static final Logger logger = LoggerFactory.getLogger(ResourceMonitor.class);
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
private final double memoryThresholdPercent;
private final double cpuThreshold;
private final AlertService alertService;
public ResourceMonitor(double memoryThresholdPercent, double cpuThreshold, AlertService alertService) {
this.memoryThresholdPercent = memoryThresholdPercent;
this.cpuThreshold = cpuThreshold;
this.alertService = alertService;
}
public void startMonitoring(int intervalSeconds) {
scheduler.scheduleAtFixedRate(this::checkResources, 0, intervalSeconds, TimeUnit.SECONDS);
}
private void checkResources() {
try {
OperatingSystemMXBean osBean = ManagementFactory.getOperatingSystemMXBean();
Runtime runtime = Runtime.getRuntime();
// Memory metrics
long maxMemory = runtime.maxMemory();
long totalMemory = runtime.totalMemory();
long freeMemory = runtime.freeMemory();
long usedMemory = totalMemory - freeMemory;
double memoryUsagePercent = (double) usedMemory / maxMemory * 100;
// CPU metrics
double cpuLoad = osBean.getSystemLoadAverage();
// Disk metrics
File root = new File("/");
long totalSpace = root.getTotalSpace();
long freeSpace = root.getFreeSpace();
double diskUsagePercent = (double) (totalSpace - freeSpace) / totalSpace * 100;
logger.info("Memory used: {}MB ({}%), CPU load: {}, Disk usage: {}%",
usedMemory / (1024 * 1024),
String.format("%.2f", memoryUsagePercent),
String.format("%.2f", cpuLoad),
String.format("%.2f", diskUsagePercent));
// Alert on thresholds
if (memoryUsagePercent > memoryThresholdPercent) {
alertService.sendAlert("Memory usage high",
"Memory usage at " + String.format("%.2f", memoryUsagePercent) + "%");
}
if (cpuLoad > cpuThreshold) {
alertService.sendAlert("CPU load high",
"CPU load at " + String.format("%.2f", cpuLoad));
}
if (diskUsagePercent > 90) {
alertService.sendAlert("Disk usage high",
"Disk usage at " + String.format("%.2f", diskUsagePercent) + "%");
}
} catch (Exception e) {
logger.error("Error monitoring resources", e);
}
}
public void shutdown() {
scheduler.shutdown();
try {
if (!scheduler.awaitTermination(5, TimeUnit.SECONDS)) {
scheduler.shutdownNow();
}
} catch (InterruptedException e) {
scheduler.shutdownNow();
Thread.currentThread().interrupt();
}
}
}
I’ve found that tracking resource trends over time is often more valuable than point-in-time measurements. Gradual increases in resource usage can indicate memory leaks or other issues that might not immediately trigger alerts.
Monitoring Java applications requires a multi-faceted approach. JMX provides native integration, Micrometer offers flexibility, health checks ensure availability, distributed tracing helps with complex architectures, and resource monitoring prevents infrastructure-related failures.
Having implemented these techniques across various projects, I can confirm that a comprehensive monitoring strategy significantly improves application reliability and performance. The key is to select the right tools for your specific requirements and ensure they work together cohesively.