java

How to Debug Java Production Issues Without Taking the System Down

Learn how to debug live Java applications using heap dumps, thread analysis, GC logs, and distributed tracing. Diagnose production issues without downtime.

How to Debug Java Production Issues Without Taking the System Down

Debugging a problem in a production Java application feels different. You can’t just stop the world, add a breakpoint, and step through the code. The system is live, users are depending on it, and every action carries more weight. Over time, I’ve learned to rely on a set of systematic approaches to find the root cause without making the situation worse. Let’s walk through some of the most effective methods.

The first step is always observation. You notice a symptom—maybe the application is getting slower, using more memory, or throwing errors. Your goal is to move from that symptom to a clear understanding of why it’s happening.

When memory usage climbs steadily until the application crashes with an OutOfMemoryError, you are likely dealing with a memory leak. Objects are being created but not released for garbage collection because something still holds a reference to them. To see this, you need a snapshot of the memory, called a heap dump.

You can get a heap dump in a couple of ways. If the application is still running, you can use the jmap tool from the command line. You’ll need the process ID of your Java application.

jmap -dump:live,format=b,file=heap.hprof 12345

This command tells the JVM to take a live heap dump from process 12345 and save it to a file named heap.hprof. The live option triggers a full garbage collection first, so you only see objects that are still in use. Sometimes, you want the dump to happen automatically when the error occurs. You can start your application with special JVM flags.

java -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/myapp -jar application.jar

Now, when the OutOfMemoryError strikes, a heap dump file will be created in /var/log/myapp. You have the evidence. The next step is analysis. A raw heap dump file isn’t human-readable. You need a tool like Eclipse Memory Analyzer (MAT) or VisualVM.

Loading the file into MAT, you are presented with a lot of information. I often start with the “Leak Suspects Report.” It gives a high-level summary of the largest objects in memory and what is keeping them alive. The key concept here is the “dominator tree.” An object dominates another if all paths to that other object go through the first one. By looking at the dominator tree, you can quickly find which single object is responsible for holding a large chunk of memory.

For example, you might find that a single HashMap instance, used as a static cache, is dominating 80% of the heap. The report shows the class of the objects it holds—perhaps millions of String keys. This points you directly to the problem: a cache that never expires entries. The fix isn’t in the heap dump, but the dump shows you exactly where to look in your code.

A sudden, sustained spike in CPU usage is another common alarm. The application is working very hard at something. To find out what, you need to see what the threads are doing. A thread dump shows the exact line of code each thread is executing at a single moment in time.

You can capture a thread dump using jstack.

jstack -l 12345 > thread_dump.txt

The -l option includes additional information about locks. Looking at one thread dump is helpful, but it’s a static picture. If a thread is in a fast loop, it might be on a different line of code each time you sample. To see the hot code path, you need to take several dumps, two or three seconds apart, and then compare them.

Open the thread_dump.txt file. You’ll see a list of all threads. Look for threads whose state is RUNNABLE. This means they are actively using CPU cycles. Then, examine their stack traces—the list of method calls that led to the current line of code.

If you take three dumps and the same thread is always RUNNABLE and always executing the same method, like com.example.Processor.calculate(), you have a strong lead. That method is likely in a tight loop or performing a very expensive calculation. For an even clearer picture, a sampling profiler like async-profiler is invaluable. It periodically samples where all your threads are executing and builds a statistical profile of where the CPU time is spent, all with very low overhead.

Sometimes, the application doesn’t use CPU; it just stops. Requests hang and eventually time out. This often points to thread contention or a deadlock. A deadlock is a classic problem where two threads are each waiting for a lock the other holds, so neither can proceed.

The good news is the JVM can detect these classic deadlocks. When you run jstack, it will usually print a section at the end if it finds any.

Found one Java-level deadlock:
=============================
"Thread-1":
  waiting to lock monitor 0x00007f88a4008f00 (object 0x000000076ac44778, a com.example.ResourceA),
  which is held by "Thread-2"
"Thread-2":
  waiting to lock monitor 0x00007f88a4009020 (object 0x000000076ac44790, a com.example.ResourceB),
  which is held by "Thread-1"

The output clearly shows the circular wait. The fix involves changing the order in which locks are acquired or using timeouts with tryLock. Not all stalls are full deadlocks. Sometimes, many threads are just waiting for a single, heavily contended lock. This is high contention. In your thread dump, you’ll see many threads in BLOCKED state, all trying to enter a synchronized block on the same object. The solution might be to reduce the scope of the lock, use a concurrent collection, or employ a read-write lock.

The garbage collector is your silent partner, cleaning up unused memory. When it struggles, your application struggles. Long or frequent garbage collection pauses, called “stop-the-world” events, freeze your application threads. To understand GC behavior, you must enable logging.

Modern JDKs use a unified logging system. Here’s how you can turn on detailed GC logging.

java -Xlog:gc*,gc+heap*,gc+age*:file=gc.log:time:filecount=5,filesize=10m -jar app.jar

This logs all GC events and heap details to a rotating file. After a period of poor performance, you examine gc.log. What are you looking for? Patterns. You might see many “Full GC” events happening in quick succession. A Full GC cleans the entire heap and is typically slow. Frequent Full GCs often mean the “Old Generation” part of the heap is filling up too fast. This could be due to a memory leak, or it could mean your heap is simply too small for the application’s workload.

Conversely, you might see that “Young GC” pauses, which are usually short, are taking several seconds. This suggests a problem with the creation rate of short-lived objects. Perhaps a new feature is allocating huge, temporary arrays in a loop. The GC logs give you the timing and cause; your job is to connect that to a recent code change or increased load.

In today’s world of microservices, a single user request can travel through five or six different applications. If a request is slow, which service is the culprit? This is where distributed tracing shines. Tools like Jaeger or Zipkin work by assigning a unique trace ID to each incoming request at the very edge of your system.

This trace ID is then passed along with every subsequent internal call, whether it’s an HTTP request to another service or a database query. Each service adds its own “span” to the trace, recording the start and end time of its work.

Implementing it involves a small bit of code. Using a library like OpenTelemetry, you might instrument a web controller.

@RestController
public class OrderController {

    private final Tracer tracer;

    @GetMapping("/order/{id}")
    public Order getOrder(@PathVariable String id) {
        Span span = tracer.spanBuilder("getOrder").startSpan();
        try (Scope scope = span.makeCurrent()) {
            // The trace context is now automatically propagated in this scope
            Order order = orderService.findOrder(id);
            span.setAttribute("order.id", id);
            return order;
        } catch (Exception e) {
            span.recordException(e);
            throw e;
        } finally {
            span.end();
        }
    }
}

Later, in a dashboard, you can see a visual waterfall diagram of the entire request. You might discover that the 2-second delay is not in your Java code at all, but in a call to a third-party payment service, or in a specific, unoptimized database query. Tracing turns a vague “the system is slow” into “this specific call to Service X is slow.”

Your application is almost certainly writing logs. In production, the volume can be overwhelming. The key is structure. Writing logs as plain lines of text makes them hard to filter and analyze at scale. Instead, log in a structured format like JSON.

Many logging frameworks can do this. Here’s an example with SLF4J and Logback, using the Mapped Diagnostic Context (MDC) to add fields.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

public class PaymentService {
    private static final Logger logger = LoggerFactory.getLogger(PaymentService.class);

    public void processPayment(String userId, String requestId) {
        // Put contextual information into the MDC
        MDC.put("userId", userId);
        MDC.put("requestId", requestId);

        logger.info("Starting payment processing");
        // ... business logic ...

        if (paymentFailed) {
            logger.error("Payment failed for user");
        }

        // Clear the MDC for this thread when done
        MDC.clear();
    }
}

With a layout configured for JSON, a log entry looks like this:

{
  "timestamp": "2023-10-27T10:15:30.123Z",
  "level": "ERROR",
  "logger": "PaymentService",
  "message": "Payment failed for user",
  "userId": "user-456",
  "requestId": "req-789"
}

Now, in your log aggregation system, you can run powerful queries. You can find all errors for a specific userId. You can group errors by the logger field to see which class is throwing the most exceptions. You can track a single requestId across all the microservices it touched. This transforms logs from a text file you grep into a queryable dataset for investigation.

Java Flight Recorder (JFR) is a remarkable tool that comes bundled with the JDK. It is designed to have such a low performance overhead that you can run it continuously in production. JFR records events happening inside the JVM: method calls, garbage collections, file I/O, socket reads, thread parks, and much more.

Starting a recording is simple. You can do it from the command line when launching the app.

java -XX:StartFlightRecording=filename=myrecording.jfr,duration=1h -jar app.jar

This starts a one-hour recording. You can also trigger it on demand from within your code or using tools like jcmd.

jcmd 12345 JFR.start name=MyRecording duration=60s filename=/tmp/trouble.jfr

After an incident, you stop the recording and download the .jfr file. You open it with JDK Mission Control (JMC). The interface lets you explore all the recorded events on a timeline. You can see a spike in “Java Blocking” events that corresponds to the time users reported slowness. Drilling down, you find that the blocking was caused by a particular ReentrantLock. You can see the stack trace of the threads holding and waiting for that lock. JFR provides a correlated, time-synchronized view of many different aspects of your application’s behavior, which is incredibly powerful for diagnosing complex, intermittent issues.

Not all problems originate within your JVM. The application communicates with the outside world—databases, other services, APIs. Network issues can cause timeouts, slow responses, and failures.

Basic command-line tools are your first line of defense. From the host or container running your app, you can test connectivity.

# Can I reach the host?
ping database-host

# Is the specific port open?
nc -zv database-host 5432

# What route do packets take?
traceroute database-host

Within your Java application, you can add more insight. If you’re using an HTTP client like Apache HttpClient or OkHttp, configure it to log connection and request timings.

HttpClient client = HttpClient.newBuilder()
    .connectTimeout(Duration.ofSeconds(2))
    .build();

HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("https://external-api.com/data"))
    .timeout(Duration.ofSeconds(5))
    .build();

Instant start = Instant.now();
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
Instant end = Instant.now();

logger.debug("Call to external-api took {} ms", Duration.between(start, end).toMillis());

A common hidden problem is DNS resolution delay. If your application makes many short-lived connections, the time to resolve the hostname each time can add up. Using a caching resolver or using IP addresses directly in load-balanced scenarios can help. Also, check your connection pool settings. A pool that is too small will lead to threads waiting for a connection, mimicking a slow service.

A frustrating category of bugs is when the code is correct, but the environment is wrong. Perhaps a configuration property has a different value in production than in staging. Maybe a library was upgraded with a subtle behavior change. This is “configuration drift.”

A simple defensive practice is to log the important configuration settings and their sources on startup.

import org.springframework.beans.factory.annotation.Value;
import javax.annotation.PostConstruct;

@Component
public class ConfigLogger {

    @Value("${database.url:Not Found}")
    private String dbUrl;

    @Value("${cache.enabled:false}")
    private boolean cacheEnabled;

    @PostConstruct
    public void logConfig() {
        logger.info("Application starting with configuration:");
        logger.info("  database.url = {}", dbUrl);
        logger.info("  cache.enabled = {}", cacheEnabled);
        logger.info("  Java version = {}", System.getProperty("java.version"));
    }
}

Now, if there’s a problem connecting to the database, the first place to check is the log to see what URL the application actually tried to use. You should also have a way to compare dependencies. The output of mvn dependency:tree or examining the BOOT-INF/lib directory in a Spring Boot jar can reveal if a different version of a critical library like Jackson or Netty has been pulled in.

Sometimes, despite all the tools, the cause remains elusive. You have a hypothesis, but you need to test it in the live environment. This is where you must experiment, but you must do it safely and incrementally.

Feature flags are a powerful tool here. You can deploy code that contains a new, potentially better-performing algorithm alongside the old one, but it’s disabled by default.

public class OrderProcessor {
    private final FeatureFlagService flags;

    public Result process(Order order) {
        Result result;
        // Use a feature flag based on the user ID for a gradual rollout
        if (flags.isEnabled("new-processing-algo", order.getUserId())) {
            result = newAlgorithm(order);
            metrics.counter("algorithm", "new").increment();
        } else {
            result = oldAlgorithm(order);
            metrics.counter("algorithm", "old").increment();
        }
        // Log detailed timing for analysis
        logger.debug("Processed order {} with algorithm {}", order.getId(), flags.isEnabled(...) ? "new" : "old");
        return result;
    }
}

Now, you can enable the new algorithm for 1% of users, or just for your own test account. You can monitor the logs and metrics for errors and performance differences. If something goes wrong, you turn the flag off. This allows for direct A/B testing of fixes and improvements in the real production environment with real load and data, but with a safety net.

Another form of experiment is targeted logging. You can temporarily increase the log level to DEBUG for a single, problematic user session or for a specific component, to get a flood of detailed information about that one case without drowning the entire system in log data.

All these techniques share a common theme: they are about gathering evidence. Production debugging is detective work. You start with a symptom—high CPU, memory growth, errors. You use your tools—heap dumps, thread dumps, traces, logs, profilers—to collect data. You form a hypothesis: “I think the new caching layer is holding onto objects.” You test it: examine the heap dump for cache references, or turn the cache off with a feature flag for a few users.

The goal is twofold. First, restore normal service as quickly as possible, which might involve a rollback, a restart, or a configuration hotfix. Second, and more importantly, understand the root cause well enough to prevent it from happening again. This often means adding better monitoring, writing a test, or fixing the flawed logic. By approaching the problem calmly and methodically, using these techniques as your toolkit, you can solve even the most puzzling production issues.

Keywords: Java production debugging, debugging Java applications in production, Java performance troubleshooting, production Java memory leak, Java OutOfMemoryError fix, heap dump analysis Java, jmap heap dump command, Eclipse Memory Analyzer MAT tutorial, Java thread dump analysis, jstack command usage, Java deadlock detection, Java CPU spike troubleshooting, Java garbage collection tuning, GC log analysis Java, Java stop-the-world pause, distributed tracing Java, OpenTelemetry Java instrumentation, Jaeger Zipkin Java tracing, structured logging Java, SLF4J MDC logging, Logback JSON logging, Java Flight Recorder production, JFR JMC tutorial, jcmd Java command, Java connection pool tuning, Java network timeout debugging, Java configuration drift, feature flags Java production, async-profiler Java, Java microservices debugging, Java heap dump tools, VisualVM heap analysis, Java RUNNABLE thread state, Java BLOCKED thread state, Java thread contention, ReentrantLock Java debugging, Java sampling profiler, Spring Boot production debugging, Java DNS resolution performance, Java memory management, JVM flags production, Java application monitoring, root cause analysis Java, Java Full GC frequency, Java Young GC pause, Java dominator tree MAT, Java concurrent collections, tryLock Java deadlock fix, Java read-write lock, Java log aggregation, requestId tracing microservices



Similar Posts
Blog Image
Real-Time Analytics Unleashed: Stream Processing with Apache Flink and Spring Boot

Apache Flink and Spring Boot combine for real-time analytics, offering stream processing and easy development. This powerful duo enables fast decision-making with up-to-the-minute data, revolutionizing how businesses handle real-time information processing.

Blog Image
6 Essential Techniques for Optimizing Java Database Interactions

Learn 6 techniques to optimize Java database interactions. Boost performance with connection pooling, batch processing, prepared statements, ORM tools, caching, and async operations. Improve your app's efficiency today!

Blog Image
Unlocking the Elegance of Java Testing with Hamcrest's Magical Syntax

Turning Mundane Java Testing into a Creative Symphony with Hamcrest's Elegant Syntax and Readable Assertions

Blog Image
5 Advanced Java Concurrency Utilities for High-Performance Applications

Discover 5 advanced Java concurrency utilities to boost app performance. Learn how to use StampedLock, ForkJoinPool, CompletableFuture, Phaser, and LongAdder for efficient multithreading. Improve your code now!

Blog Image
**Optimize Java Database Performance: Essential Connection Pooling and Query Tuning Strategies for Faster Applications**

Optimize Java-database performance with connection pooling, JPA tuning, caching, and batch processing. Learn proven techniques to reduce query times and handle high-traffic loads effectively.

Blog Image
Unleashing the Superpowers of Resilient Distributed Systems with Spring Cloud Stream and Kafka

Crafting Durable Microservices: Strengthening Software Defenses with Spring Cloud Stream and Kafka Magic