I want to share some practical ways to find and fix problems in Java applications that are already live and serving users. This kind of debugging is different from working on your local machine. You can’t just pause the program or restart it easily. You need to use specific tools and methods to see what’s happening inside the application while it’s running.
Let’s start with a common problem: an application that becomes slow or stops responding entirely. When this happens, the first thing I often do is take a thread dump. Think of a thread dump as a snapshot of every single task your application is trying to do at that exact moment. Each task is a “thread.” The dump shows you what each thread is doing and where it might be stuck.
You can capture one easily using command-line tools that come with the Java Development Kit. If you know the process ID of your application, you can run a simple command.
jcmd <your_pid_here> Thread.print > /tmp/thread_dump.txt
After you have this file, open it. You’ll see a list of threads with names and states like RUNNABLE, BLOCKED, or WAITING. The most important part is the stack trace for each thread—it’s the list of method calls that led to its current state. I look for patterns. Do many threads show they are BLOCKED, waiting for the same lock or resource? That points to heavy contention, where threads are lining up and slowing each other down. If I see many threads in a RUNNABLE state, all stuck in the same bit of code, it might be a tight, endless loop consuming CPU.
One tip I always follow is to never rely on a single thread dump. The real story is in the change, or lack of it. I take three or four dumps, spaced about ten seconds apart. Then I compare them. Threads that are in the exact same state, stuck on the exact same line of code across all dumps, are the ones causing the trouble. They are frozen, and they’re likely freezing your application.
Another frequent issue in production is the application gradually using more and more memory until it becomes sluggish and eventually crashes with an OutOfMemoryError. This is often a memory leak. To investigate, you need a heap dump. A heap dump is a complete snapshot of every object living in your application’s memory at a specific time.
Triggering a dump is straightforward. You can use jcmd again, which is my preferred tool as it’s generally safe for production.
jcmd <your_pid_here> GC.heap_dump /tmp/heap.hprof
This command creates a file called heap.hprof. You can’t analyze this raw file with a text editor. You need a tool like Eclipse Memory Analyzer (MAT). When you open the dump in MAT, it can feel overwhelming at first. I start with the built-in reports. The “Leak Suspects” report is incredibly useful—it does automated analysis and often points you directly at the problem.
For a more manual approach, I use the “Dominator Tree” view. This shows you which objects are responsible for holding the largest chunks of memory in place. The chain of references from these big “dominator” objects down to the actual data is what you need to see. A classic pattern I’ve seen is a static HashMap used as a cache. Over time, entries are added but never removed. Even if the business logic thinks the data is expired, the reference from the map keeps every cached object alive, preventing the garbage collector from doing its job.
To understand memory pressure over time, you must look at the Garbage Collector’s own logs. The GC is the janitor of your application’s memory, and its work log tells you how hard it’s working. Modern Java versions use a unified logging system that is very powerful.
You enable it with a JVM startup argument.
java -Xlog:gc*,gc+heap=debug:file=gc.log:time,uptimemillis,level,tags -jar myapp.jar
This creates a gc.log file. Reading it line by line teaches you the rhythm of your application. You see “Young GC” events happening frequently—these are usually fast and normal. The red flag is the “Full GC” event. This is when the entire application can pause for a noticeable amount of time.
I look for two things. First, the duration of Full GC pauses. If they start taking multiple seconds, your users will feel it. Second, I look at the memory graph within the log. If the “old generation” memory usage climbs steadily before each Full GC and the Full GC only reclaims a small amount, that’s the fingerprint of a memory leak. The heap is filling up with objects that can’t be cleaned.
For continuous, low-cost visibility into a running application, the JDK Flight Recorder is a game-changer. It’s a profiling tool built right into the JVM. You can have it running all the time in production with a minimal performance hit, which is something traditional profilers can’t do.
You start it with a command-line option.
java -XX:StartFlightRecording=disk=true,maxsize=1g,maxage=24h,name=MyAppRecording -jar myapp.jar
This starts a recording that keeps the last 24 hours of data in a 1-gigabyte rolling buffer on disk. When an incident occurs—say, a latency spike at 2:00 AM—you can extract just the relevant period.
jcmd <your_pid_here> JFR.dump name=MyAppRecording filename=/tmp/incident_2am.jfr
You then open the .jfr file in JDK Mission Control. The tool provides an automated analysis tab that highlights issues like high lock contention, methods that are unusually slow (“hot methods”), and what parts of the code are allocating the most memory. The flame graph visualization is particularly effective. It shows you a picture of CPU usage where the width of each box represents time spent. You can instantly see the “hottest” stack trace, the code path that was consuming the most CPU during your problem window.
When your application is not a single monolith but a collection of microservices, debugging gets a new dimension. A request might start at a user’s browser, hit a gateway, move to an order service, then to an inventory service, and finally to a payment processor. If the request is slow, which link in the chain is the problem? This is where distributed tracing comes in.
Tools like OpenTelemetry provide libraries to instrument your code. The core idea is to generate a unique trace ID at the very beginning of a request and carry that ID through every subsequent call.
Here’s a simplified example of how you might manually propagate a trace from one service to another over HTTP.
// In the first service
Span span = tracer.spanBuilder("processOrder").startSpan();
String traceId = span.getSpanContext().getTraceId();
try (Scope scope = span.makeCurrent()) {
// Build a request to the inventory service
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("http://inventory-service/check"))
.header("traceparent", "00-" + traceId + "-" + spanId + "-01") // Standard header
.build();
// Send the request...
HttpClient.newHttpClient().send(request, BodyHandlers.ofString());
} finally {
span.end();
}
The receiving service extracts this header and continues the trace. Once all this data is sent to a tracing backend like Jaeger or Zipkin, you get a visual timeline. You can see a single request as a horizontal bar stretching across all the services it touched. The bar for a slow database call will be noticeably longer. You can instantly see if a delay is in your service logic or in a downstream call, and if a failure in one service caused a cascade of failures in others.
Tracing is powerful, but it’s complemented by logging. The challenge with logs in a distributed system is correlation. You might have 50 log lines across 5 services for one failed request. How do you piece them together? The answer is a correlation ID. It’s similar to a trace ID but is specifically for stitching log messages together.
A common way to implement this is with a Servlet Filter or similar interceptor that runs at the start of every web request.
public class CorrelationFilter implements Filter {
@Override
public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) throws IOException, ServletException {
HttpServletRequest request = (HttpServletRequest) req;
HttpServletResponse response = (HttpServletResponse) res;
// Get existing ID or generate a new one
String correlationId = request.getHeader("X-Correlation-ID");
if (correlationId == null || correlationId.isEmpty()) {
correlationId = UUID.randomUUID().toString();
}
// Put it into a thread-local context (MDC is common in log frameworks)
MDC.put("correlationId", correlationId);
// Send it back in the response header for the client
response.setHeader("X-Correlation-ID", correlationId);
try {
chain.doFilter(request, response);
} finally {
MDC.clear(); // Clean up to avoid memory leaks
}
}
}
Then, in your logging configuration (like logback.xml), you include this correlationId from the MDC in every log pattern.
<pattern>%d{ISO8601} [%thread] [%X{correlationId}] %-5level %logger{36} - %msg%n</pattern>
Now, every log message from a single request has the same unique ID. When a user reports an error, you can take their provided correlation ID, search your centralized log store, and instantly see the complete story of their request from the front-end API call through every internal service and database query. It turns a haystack of logs into a clear, linear narrative.
Some of the most frustrating bugs are concurrency issues—race conditions and deadlocks. They happen intermittently, are nearly impossible to reproduce on demand, and often vanish when you try to debug them. In production, your main evidence for these is, again, the thread dump.
The JVM is helpful here. When it detects a deadlock, it will usually print a clear section at the end of the thread dump titled “Found one Java-level deadlock.” It will then list the threads involved and the locks each is holding and waiting for. It looks like this: “Thread-1” holds lock A and is waiting for lock B, while “Thread-2” holds lock B and is waiting for lock A. It’s a circular dependency that will never resolve.
For race conditions, where two threads conflict over data in a non-deterministic way, thread dumps are less directly helpful because the problem is about timing. This is where testing frameworks like jcstress (Java Concurrency Stress) are invaluable during development. They help you formally test the thread-safety of your code.
@JCStressTest
@Outcome(id = "1", expect = Expect.ACCEPTABLE, desc = "Correct update.")
@Outcome(id = "2", expect = Expect.ACCEPTABLE, desc = "Correct update.")
@Outcome(expect = Expect.FORBIDDEN, desc = "Other results are broken.")
@State
public class CounterTest {
private int counter = 0;
@Actor
public void actor1(II_Result r) {
r.r1 = ++counter; // Non-atomic operation: read, increment, write
}
@Actor
public void actor2(II_Result r) {
r.r2 = ++counter;
}
@Arbiter
public void arbiter(II_Result r) {
r.r1 = counter;
}
}
This test will hammer the non-atomic ++counter operation from two threads thousands of times. The @Outcome annotations define what results are valid. If the final counter is anything other than the sum of operations, you’ve found a race condition. Finding these issues in a controlled test is far better than discovering them in production.
Class loading problems can bring an application to a halt during startup or cause mysterious NoClassDefFoundError exceptions later. These often stem from dependency conflicts—two different versions of the same library in your classpath.
A first line of inquiry is to ask the JVM to tell you exactly what it’s loading and from where.
java -verbose:class -jar myapp.jar 2>&1 | grep "com.example.MyProblemClass"
This will show you the jar file from which MyProblemClass was loaded. If you see it loaded from an unexpected or old library, you’ve found a clue. For Maven projects, the mvn dependency:tree command is essential. It draws a map of all your dependencies and their transitive dependencies. Look for the same library artifact appearing multiple times with different versions. Maven’s “nearest definition” rule will pick one, and it might not be the one you expect.
In the modern era of Java with modules (module-info.java), a different class of problem emerges: a required package might not be exported by its module or your module might not declare a required dependency. The error messages here can be explicit, pointing you directly to the missing requires statement.
Slow applications are often slow because they are waiting, not computing. They are waiting on network calls to databases, APIs, or other services. The first line of defense is proper timeout configuration. An HTTP client without timeouts can wait forever, causing your threads to pile up and stall.
HttpClient client = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(3))
.readTimeout(Duration.ofSeconds(10)) // Crucial for production
.build();
If timeouts are firing, you need to know why the external call is slow. Is it the network, the remote service, or something else? Inside the JVM, JDK Flight Recorder can again help by recording socket read/write events. Externally, you can use system-level tools. On Linux, a command like tcpdump can capture network traffic, or ss can show socket statistics.
Don’t forget DNS. A surprising number of performance issues stem from DNS lookups. If your application makes many calls to different hosts and doesn’t cache DNS results, it might be constantly asking the operating system to resolve names, which can introduce latency. Using a caching resolver or ensuring your connection pools reuse established connections can mitigate this.
Finally, the Java Management Extensions (JMX) provide a standardized way to monitor and manage the JVM. Many application servers, frameworks, and even your own code can expose metrics and operations through JMX.
You can connect to a running JVM process using a tool like jconsole or VisualVM.
jconsole <your_pid_here>
Once connected, you navigate through the MBean tree. Key areas include:
java.lang:type=Memory: Check heap and non-heap memory usage.java.lang:type=Threading: See live thread count and peak thread count.java.lang:type=OperatingSystem: View system CPU load.
The real power comes from custom MBeans. You can instrument your own code to expose critical business metrics or operational controls.
@ManagedResource(objectName = "com.myapp:type=Cache,name=UserCache")
public class UserCacheMBean {
private final Cache cache;
public UserCacheMBean(Cache cache) { this.cache = cache; }
@ManagedAttribute
public int getSize() {
return cache.size();
}
@ManagedAttribute
public long getHitCount() {
return cache.stats().hitCount();
}
@ManagedOperation
public void clearCache() {
cache.invalidateAll();
System.out.println("Cache cleared via JMX operation.");
}
}
By registering this MBean, you can now monitor your cache’s size and hit rate from your JMX console. More importantly, you can invoke the clearCache() operation remotely. This allows you to perform targeted diagnostics or fixes without a full restart—like clearing a corrupted cache entry that’s causing errors.
These techniques form a toolkit. You won’t use every tool for every problem. A sudden CPU spike calls for thread dumps and JFR. A gradual memory increase points to heap dumps and GC logs. A slow user request in a microservice architecture needs tracing and correlated logs. The key is to understand what each tool reveals and to approach production debugging systematically, starting with the observable symptoms and drilling down to the root cause with the right instrument. It’s the art of understanding a complex system from the outside, using the data it chooses to reveal.