Java Apr 18, 2026

How to Debug Java Production Issues Without Taking the System Down

Learn how to debug live Java applications using heap dumps, thread analysis, GC logs, and distributed tracing. Diagnose production issues without downtime.

Debugging a problem in a production Java application feels different. You can’t just stop the world, add a breakpoint, and step through the code. The system is live, users are depending on it, and every action carries more weight. Over time, I’ve learned to rely on a set of systematic approaches to find the root cause without making the situation worse. Let’s walk through some of the most effective methods.

The first step is always observation. You notice a symptom—maybe the application is getting slower, using more memory, or throwing errors. Your goal is to move from that symptom to a clear understanding of why it’s happening.

When memory usage climbs steadily until the application crashes with an OutOfMemoryError, you are likely dealing with a memory leak. Objects are being created but not released for garbage collection because something still holds a reference to them. To see this, you need a snapshot of the memory, called a heap dump.

You can get a heap dump in a couple of ways. If the application is still running, you can use the jmap tool from the command line. You’ll need the process ID of your Java application.

jmap -dump:live,format=b,file=heap.hprof 12345

This command tells the JVM to take a live heap dump from process 12345 and save it to a file named heap.hprof. The live option triggers a full garbage collection first, so you only see objects that are still in use. Sometimes, you want the dump to happen automatically when the error occurs. You can start your application with special JVM flags.

java -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/myapp -jar application.jar

Now, when the OutOfMemoryError strikes, a heap dump file will be created in /var/log/myapp. You have the evidence. The next step is analysis. A raw heap dump file isn’t human-readable. You need a tool like Eclipse Memory Analyzer (MAT) or VisualVM.

Loading the file into MAT, you are presented with a lot of information. I often start with the “Leak Suspects Report.” It gives a high-level summary of the largest objects in memory and what is keeping them alive. The key concept here is the “dominator tree.” An object dominates another if all paths to that other object go through the first one. By looking at the dominator tree, you can quickly find which single object is responsible for holding a large chunk of memory.

For example, you might find that a single HashMap instance, used as a static cache, is dominating 80% of the heap. The report shows the class of the objects it holds—perhaps millions of String keys. This points you directly to the problem: a cache that never expires entries. The fix isn’t in the heap dump, but the dump shows you exactly where to look in your code.

A sudden, sustained spike in CPU usage is another common alarm. The application is working very hard at something. To find out what, you need to see what the threads are doing. A thread dump shows the exact line of code each thread is executing at a single moment in time.

You can capture a thread dump using jstack.

jstack -l 12345 > thread_dump.txt

The -l option includes additional information about locks. Looking at one thread dump is helpful, but it’s a static picture. If a thread is in a fast loop, it might be on a different line of code each time you sample. To see the hot code path, you need to take several dumps, two or three seconds apart, and then compare them.

Open the thread_dump.txt file. You’ll see a list of all threads. Look for threads whose state is RUNNABLE. This means they are actively using CPU cycles. Then, examine their stack traces—the list of method calls that led to the current line of code.

If you take three dumps and the same thread is always RUNNABLE and always executing the same method, like com.example.Processor.calculate(), you have a strong lead. That method is likely in a tight loop or performing a very expensive calculation. For an even clearer picture, a sampling profiler like async-profiler is invaluable. It periodically samples where all your threads are executing and builds a statistical profile of where the CPU time is spent, all with very low overhead.

Sometimes, the application doesn’t use CPU; it just stops. Requests hang and eventually time out. This often points to thread contention or a deadlock. A deadlock is a classic problem where two threads are each waiting for a lock the other holds, so neither can proceed.

The good news is the JVM can detect these classic deadlocks. When you run jstack, it will usually print a section at the end if it finds any.

Found one Java-level deadlock:
=============================
"Thread-1":
  waiting to lock monitor 0x00007f88a4008f00 (object 0x000000076ac44778, a com.example.ResourceA),
  which is held by "Thread-2"
"Thread-2":
  waiting to lock monitor 0x00007f88a4009020 (object 0x000000076ac44790, a com.example.ResourceB),
  which is held by "Thread-1"

The output clearly shows the circular wait. The fix involves changing the order in which locks are acquired or using timeouts with tryLock. Not all stalls are full deadlocks. Sometimes, many threads are just waiting for a single, heavily contended lock. This is high contention. In your thread dump, you’ll see many threads in BLOCKED state, all trying to enter a synchronized block on the same object. The solution might be to reduce the scope of the lock, use a concurrent collection, or employ a read-write lock.

The garbage collector is your silent partner, cleaning up unused memory. When it struggles, your application struggles. Long or frequent garbage collection pauses, called “stop-the-world” events, freeze your application threads. To understand GC behavior, you must enable logging.

Modern JDKs use a unified logging system. Here’s how you can turn on detailed GC logging.

java -Xlog:gc*,gc+heap*,gc+age*:file=gc.log:time:filecount=5,filesize=10m -jar app.jar

This logs all GC events and heap details to a rotating file. After a period of poor performance, you examine gc.log. What are you looking for? Patterns. You might see many “Full GC” events happening in quick succession. A Full GC cleans the entire heap and is typically slow. Frequent Full GCs often mean the “Old Generation” part of the heap is filling up too fast. This could be due to a memory leak, or it could mean your heap is simply too small for the application’s workload.

Conversely, you might see that “Young GC” pauses, which are usually short, are taking several seconds. This suggests a problem with the creation rate of short-lived objects. Perhaps a new feature is allocating huge, temporary arrays in a loop. The GC logs give you the timing and cause; your job is to connect that to a recent code change or increased load.

In today’s world of microservices, a single user request can travel through five or six different applications. If a request is slow, which service is the culprit? This is where distributed tracing shines. Tools like Jaeger or Zipkin work by assigning a unique trace ID to each incoming request at the very edge of your system.

This trace ID is then passed along with every subsequent internal call, whether it’s an HTTP request to another service or a database query. Each service adds its own “span” to the trace, recording the start and end time of its work.

Implementing it involves a small bit of code. Using a library like OpenTelemetry, you might instrument a web controller.

@RestController
public class OrderController {

    private final Tracer tracer;

    @GetMapping("/order/{id}")
    public Order getOrder(@PathVariable String id) {
        Span span = tracer.spanBuilder("getOrder").startSpan();
        try (Scope scope = span.makeCurrent()) {
            // The trace context is now automatically propagated in this scope
            Order order = orderService.findOrder(id);
            span.setAttribute("order.id", id);
            return order;
        } catch (Exception e) {
            span.recordException(e);
            throw e;
        } finally {
            span.end();
        }
    }
}

Later, in a dashboard, you can see a visual waterfall diagram of the entire request. You might discover that the 2-second delay is not in your Java code at all, but in a call to a third-party payment service, or in a specific, unoptimized database query. Tracing turns a vague “the system is slow” into “this specific call to Service X is slow.”

Your application is almost certainly writing logs. In production, the volume can be overwhelming. The key is structure. Writing logs as plain lines of text makes them hard to filter and analyze at scale. Instead, log in a structured format like JSON.

Many logging frameworks can do this. Here’s an example with SLF4J and Logback, using the Mapped Diagnostic Context (MDC) to add fields.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

public class PaymentService {
    private static final Logger logger = LoggerFactory.getLogger(PaymentService.class);

    public void processPayment(String userId, String requestId) {
        // Put contextual information into the MDC
        MDC.put("userId", userId);
        MDC.put("requestId", requestId);

        logger.info("Starting payment processing");
        // ... business logic ...

        if (paymentFailed) {
            logger.error("Payment failed for user");
        }

        // Clear the MDC for this thread when done
        MDC.clear();
    }
}

With a layout configured for JSON, a log entry looks like this:

{
  "timestamp": "2023-10-27T10:15:30.123Z",
  "level": "ERROR",
  "logger": "PaymentService",
  "message": "Payment failed for user",
  "userId": "user-456",
  "requestId": "req-789"
}

Now, in your log aggregation system, you can run powerful queries. You can find all errors for a specific userId. You can group errors by the logger field to see which class is throwing the most exceptions. You can track a single requestId across all the microservices it touched. This transforms logs from a text file you grep into a queryable dataset for investigation.

Java Flight Recorder (JFR) is a remarkable tool that comes bundled with the JDK. It is designed to have such a low performance overhead that you can run it continuously in production. JFR records events happening inside the JVM: method calls, garbage collections, file I/O, socket reads, thread parks, and much more.

Starting a recording is simple. You can do it from the command line when launching the app.

java -XX:StartFlightRecording=filename=myrecording.jfr,duration=1h -jar app.jar

This starts a one-hour recording. You can also trigger it on demand from within your code or using tools like jcmd.

jcmd 12345 JFR.start name=MyRecording duration=60s filename=/tmp/trouble.jfr

After an incident, you stop the recording and download the .jfr file. You open it with JDK Mission Control (JMC). The interface lets you explore all the recorded events on a timeline. You can see a spike in “Java Blocking” events that corresponds to the time users reported slowness. Drilling down, you find that the blocking was caused by a particular ReentrantLock. You can see the stack trace of the threads holding and waiting for that lock. JFR provides a correlated, time-synchronized view of many different aspects of your application’s behavior, which is incredibly powerful for diagnosing complex, intermittent issues.

Not all problems originate within your JVM. The application communicates with the outside world—databases, other services, APIs. Network issues can cause timeouts, slow responses, and failures.

Basic command-line tools are your first line of defense. From the host or container running your app, you can test connectivity.

# Can I reach the host?
ping database-host

# Is the specific port open?
nc -zv database-host 5432

# What route do packets take?
traceroute database-host

Within your Java application, you can add more insight. If you’re using an HTTP client like Apache HttpClient or OkHttp, configure it to log connection and request timings.

HttpClient client = HttpClient.newBuilder()
    .connectTimeout(Duration.ofSeconds(2))
    .build();

HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("https://external-api.com/data"))
    .timeout(Duration.ofSeconds(5))
    .build();

Instant start = Instant.now();
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
Instant end = Instant.now();

logger.debug("Call to external-api took {} ms", Duration.between(start, end).toMillis());

A common hidden problem is DNS resolution delay. If your application makes many short-lived connections, the time to resolve the hostname each time can add up. Using a caching resolver or using IP addresses directly in load-balanced scenarios can help. Also, check your connection pool settings. A pool that is too small will lead to threads waiting for a connection, mimicking a slow service.

A frustrating category of bugs is when the code is correct, but the environment is wrong. Perhaps a configuration property has a different value in production than in staging. Maybe a library was upgraded with a subtle behavior change. This is “configuration drift.”

A simple defensive practice is to log the important configuration settings and their sources on startup.

import org.springframework.beans.factory.annotation.Value;
import javax.annotation.PostConstruct;

@Component
public class ConfigLogger {

    @Value("${database.url:Not Found}")
    private String dbUrl;

    @Value("${cache.enabled:false}")
    private boolean cacheEnabled;

    @PostConstruct
    public void logConfig() {
        logger.info("Application starting with configuration:");
        logger.info("  database.url = {}", dbUrl);
        logger.info("  cache.enabled = {}", cacheEnabled);
        logger.info("  Java version = {}", System.getProperty("java.version"));
    }
}

Now, if there’s a problem connecting to the database, the first place to check is the log to see what URL the application actually tried to use. You should also have a way to compare dependencies. The output of mvn dependency:tree or examining the BOOT-INF/lib directory in a Spring Boot jar can reveal if a different version of a critical library like Jackson or Netty has been pulled in.

Sometimes, despite all the tools, the cause remains elusive. You have a hypothesis, but you need to test it in the live environment. This is where you must experiment, but you must do it safely and incrementally.

Feature flags are a powerful tool here. You can deploy code that contains a new, potentially better-performing algorithm alongside the old one, but it’s disabled by default.

public class OrderProcessor {
    private final FeatureFlagService flags;

    public Result process(Order order) {
        Result result;
        // Use a feature flag based on the user ID for a gradual rollout
        if (flags.isEnabled("new-processing-algo", order.getUserId())) {
            result = newAlgorithm(order);
            metrics.counter("algorithm", "new").increment();
        } else {
            result = oldAlgorithm(order);
            metrics.counter("algorithm", "old").increment();
        }
        // Log detailed timing for analysis
        logger.debug("Processed order {} with algorithm {}", order.getId(), flags.isEnabled(...) ? "new" : "old");
        return result;
    }
}

Now, you can enable the new algorithm for 1% of users, or just for your own test account. You can monitor the logs and metrics for errors and performance differences. If something goes wrong, you turn the flag off. This allows for direct A/B testing of fixes and improvements in the real production environment with real load and data, but with a safety net.

Another form of experiment is targeted logging. You can temporarily increase the log level to DEBUG for a single, problematic user session or for a specific component, to get a flood of detailed information about that one case without drowning the entire system in log data.

All these techniques share a common theme: they are about gathering evidence. Production debugging is detective work. You start with a symptom—high CPU, memory growth, errors. You use your tools—heap dumps, thread dumps, traces, logs, profilers—to collect data. You form a hypothesis: “I think the new caching layer is holding onto objects.” You test it: examine the heap dump for cache references, or turn the cache off with a feature flag for a few users.

The goal is twofold. First, restore normal service as quickly as possible, which might involve a rollback, a restart, or a configuration hotfix. Second, and more importantly, understand the root cause well enough to prevent it from happening again. This often means adding better monitoring, writing a test, or fixing the flawed logic. By approaching the problem calmly and methodically, using these techniques as your toolkit, you can solve even the most puzzling production issues.

Keywords: Java production debuggingdebugging Java applications in productionJava performance troubleshootingproduction Java memory leakJava OutOfMemoryError fixheap dump analysis Javajmap heap dump commandEclipse Memory Analyzer MAT tutorialJava thread dump analysisjstack command usageJava deadlock detectionJava CPU spike troubleshootingJava garbage collection tuningGC log analysis JavaJava stop-the-world pausedistributed tracing JavaOpenTelemetry Java instrumentationJaeger Zipkin Java tracingstructured logging JavaSLF4J MDC loggingLogback JSON loggingJava Flight Recorder productionJFR JMC tutorialjcmd Java commandJava connection pool tuningJava network timeout debuggingJava configuration driftfeature flags Java productionasync-profiler JavaJava microservices debuggingJava heap dump toolsVisualVM heap analysisJava RUNNABLE thread stateJava BLOCKED thread stateJava thread contentionReentrantLock Java debuggingJava sampling profilerSpring Boot production debuggingJava DNS resolution performanceJava memory managementJVM flags productionJava application monitoringroot cause analysis JavaJava Full GC frequencyJava Young GC pauseJava dominator tree MATJava concurrent collectionstryLock Java deadlock fixJava read-write lockJava log aggregationrequestId tracing microservices