Debugging a problem in a live, production application feels different. You can’t just hit pause and step through the code line by line. The system is running, users are active, and you need to figure out what’s wrong without breaking anything else. Over time, I’ve learned that this kind of debugging is less about writing new code and more about learning to listen to what the application is already telling you. It’s a shift from being a creator to being a detective.
The tools and techniques I rely on are designed to be observational. They let me peek into the running Java Virtual Machine (JVM) to see its internal state—what the threads are doing, what’s filling up memory, and where the CPU time is going. The goal is always the same: move from a vague symptom like “the app is slow” or “it crashed” to a specific line of code or a configuration setting that’s causing the trouble.
Let’s start with one of the most immediate tools: the thread dump. When your application freezes, becomes unresponsive, or starts using 100% of a CPU core, a thread dump is often the first thing I reach for. It’s a snapshot that shows me what every single thread in the JVM is doing at that exact moment.
You can capture one easily. Find your application’s process ID (PID) and use a simple command.
jcmd <pid> Thread.print > dump.txt
Opening that dump.txt file can be overwhelming at first. You’ll see a list of dozens of threads with long stack traces. My eyes go straight to the thread states. I look for threads that are BLOCKED or WAITING on a lock held by another thread—that’s a classic deadlock. I also scan for many threads stuck in RUNNABLE on the same method; that’s a hot spot consuming CPU. The stack trace tells the story. If twenty threads are all stuck logging to the same file because the disk is full, the stack trace will show them all in the FileOutputStream.write method. It turns a mystery into a clear, actionable fact.
While thread dumps explain CPU and hangs, memory problems require a different snapshot: the heap dump. If your application is failing with OutOfMemoryError, a heap dump is your forensic evidence. It’s a complete picture of every object in memory and what references it.
Triggering one is straightforward.
jcmd <pid> GC.heap_dump /home/user/app_dump.hprof
The real work begins when you open that .hprof file in a tool like Eclipse Memory Analyzer (MAT). You’re not just looking for what’s big; you’re looking for what shouldn’t be there. I often start with the “Leak Suspects” report MAT generates. It’s surprisingly accurate. One time, it immediately pointed to a cached Map that was supposed to hold 100 entries but was growing boundlessly because the cache-eviction logic had a bug. The “Dominator Tree” view is also powerful. It shows you which objects are responsible for keeping large chunks of memory alive. Finding that a single, forgotten static list was holding onto thousands of old request objects is a satisfying “aha!” moment.
Memory issues aren’t just about sudden crashes. Slow, progressive degradation often shows up in the garbage collection logs. These logs are a chronicle of the JVM’s housekeeping efforts. By enabling them, you get a history of every minor and major garbage collection.
I add these flags to the application startup.
java -Xlog:gc*,gc+age=trace:file=gc.log:time,uptime:filecount=10,filesize=10m -jar myapp.jar
Reading the raw log is possible, but I usually feed it to a tool like GCeasy or GarbageCat. They visualize the data. What I’m looking for are trends. Is the old generation growing steadily between full GCs, indicating a slow leak? Are the GC pauses becoming longer and more frequent as the application runs? A pattern I’ve seen is “promotion failure,” where objects are moving from the young to the old generation too quickly, leading to inefficient collections. The logs tell you not just that GC is happening, but why it’s struggling.
For a real-time view, I use JMX. It’s like having a dashboard built into the JVM itself. You can connect with a GUI tool like JConsole or VisualVM, or query it from the command line.
jcmd <pid> VM.unlock_commercial_features
jcmd <pid> GC.class_histogram
This gives me an on-demand histogram of live objects. In a GUI, I monitor charts for heap usage and thread count. Seeing a sawtooth pattern for heap usage is normal—it goes up as objects are created and down after GC. Seeing a staircase pattern, where the baseline after each GC is higher than the last, is a red flag for a memory leak. JMX also exposes metrics specific to your application server, like active sessions or JDBC connection pool usage, which can be the direct link between a system metric and a business process.
When the issue is pure speed—high CPU without a crash—I need to see which specific methods are burning cycles. This is where a profiler comes in. My preferred tool for production is async-profiler because it has a very low overhead. You attach it to the running process, let it sample the call stacks for a minute, and it generates a flame graph.
./profiler.sh -d 60 -f /tmp/flamegraph.svg <pid>
A flame graph is a visual masterpiece of debugging. The width of each box represents how often that method was on the stack during samples. I look for the widest bars at the top of the graph. One time, the widest bar was a seemingly innocuous String.toUpperCase() call inside a loop that was processing a massive file. The fix was to move the case conversion outside the loop. The profile showed me the exact line of code to fix.
The JVM uses memory outside of the Java heap for things like thread stacks, compiled code, and direct buffers. A leak here can cause native out-of-memory errors even if your heap looks fine. Native Memory Tracking (NMT) sheds light on this.
Enable it at startup and query it later.
java -XX:NativeMemoryTracking=detail -jar app.jar
# ... later, when investigating ...
jcmd <pid> VM.native_memory detail
The report breaks down memory by category. I once tracked down a problem where “Thread” memory was constantly increasing. The NMT report confirmed it. The issue was a misconfigured thread pool that was creating new threads but never shutting them down. Without NMT, we would have been looking at heap dumps forever, finding nothing.
All these JVM tools are powerful, but they’re generic. Your application logs are your custom instrumentation. The difference between a good log and a bad log is context. A log message that says “Error processing request” is useless. A log message that says “Error processing OrderId=44521 for UserId=7812: Failed to charge credit card ending in 1234, status=INSUFFICIENT_FUNDS” tells the whole story.
I make it a practice to log in a structured way, almost like emitting events.
log.error("payment_failed",
keyValue("orderId", order.getId()),
keyValue("userId", order.getUserId()),
keyValue("errorCode", e.getCode()),
keyValue("reason", e.getMessage())
);
This structure allows me to ship logs to a system like Elasticsearch and instantly query for all failures for a specific user or a specific error code. I also ensure every log entry has a correlation ID—a unique identifier passed through every service call. When a user reports a problem, I can find that single ID and see the complete journey of their request across every microservice, which is invaluable in a distributed system.
For the most elusive bugs—the ones that happen “sometimes” and are impossible to reproduce—I’ve found recording and replaying to be a lifesaver. The idea is to capture the exact inputs and state when the bug occurs in production.
You can implement a simple recording mechanism.
if (isSamplingRequest()) {
RequestSnapshot snapshot = new RequestSnapshot(request, sessionData, System.nanoTime());
snapshotArchive.save(snapshot);
}
Later, you can take that snapshot, load it into a development environment running the exact same code version, and replay it under a debugger. It turns an intermittent production ghost into a reproducible, debuggable test case. I’ve used this to solve race conditions that only happened under heavy load at 2 PM on weekdays.
Sometimes, the problem isn’t in your Java code at all. It’s a leak of underlying system resources: file handles, database connections, or network sockets. The application will start throwing strange “too many open files” or “cannot get connection from pool” errors.
On a Linux server, you can check the current usage.
ls -l /proc/<pid>/fd | wc -l
I monitor this number over time. If it climbs steadily and never drops, something isn’t being closed. In the code, this is where try-with-resources is non-negotiable.
// This ensures the Connection is closed, even if an exception is thrown.
try (Connection conn = dataSource.getConnection();
PreparedStatement stmt = conn.prepareStatement(sql)) {
// ... use statement
}
For connection pools, setting a sensible maxLifetime and a leakDetectionThreshold helps identify code paths that borrow connections but forget to return them.
Finally, the most powerful technique is correlation. It’s about linking the “what” of a system event to the “why” of a business event. A spike in GC activity is just a graph on a dashboard. But if I can see that spike happened exactly when the nightly financial reconciliation job started, I have a direct link.
I instrument key business transactions to emit metrics.
public class OrderProcessor {
public void process(Order order) {
Timer.Sample sample = Timer.start();
try {
// ... process order
} finally {
sample.stop(Metrics.globalRegistry.timer("order.processing.time",
"customerTier", order.getCustomerTier()));
}
}
}
Now, in my monitoring dashboard, I can overlay the order.processing.time metric (and its 99th percentile) with the JVM’s GC pause time. If they jump together, I know the performance problem is affecting real users doing a specific thing, and I know where to start looking. Maybe the reconciliation job loads a huge dataset, causing memory churn and longer GCs, which in turn slows down every other transaction.
Putting it all together, debugging in production is a layered process. You start with the symptom—high CPU, out of memory, unresponsiveness. You use a thread dump, heap dump, or profiler to get a snapshot of the system’s internals. You enhance your logs and metrics to provide the business context. You use tools like NMT to look beyond the heap. Over time, you build a comprehensive picture. The key is to have these tools ready and know when to use each one. It turns a high-pressure outage into a systematic investigation, where each command you run brings you closer to a solution. It’s not magic; it’s just listening carefully to what your application is already trying to say.