When your application in production starts behaving strangely, the feeling is unmistakable. The graphs on your dashboard begin to tilt in the wrong direction. Latency numbers creep up. Error rates, once flatlined, develop a faint but persistent pulse. You get that sinking feeling in your stomach. Something is wrong, and you need to figure out what, fast. The Java Virtual Machine (JVM) is a complex engine, and when it sputters, the reasons aren’t always obvious. Over the years, I’ve learned that effective troubleshooting isn’t about magic; it’s about having a systematic way to ask the JVM the right questions and understand its answers.
Let’s walk through some of the most reliable ways I do that. Think of this as a guide to listening to what your application is trying to tell you.
The phone rings at 2 AM. The dashboard is red. The application is still responding, but every request is taking ten times longer than it should. Where do you even start? My first move is often to look at what every single thread is doing at that exact moment. A thread dump is like pressing pause on the entire JVM and getting a list of every task in progress.
You can get one using a simple command. If your application’s process ID is 12345, you run jstack 12345 > emergency_dump.txt. This command connects to the JVM and prints out the state and stack trace of every thread. Sometimes, you might not have jstack handy. In that case, sending a signal to the process works too: kill -3 12345. This tells the JVM to print the thread dump to its standard output, which is usually captured in a log file.
Now you have this text file. What are you looking for? You scan for threads that are BLOCKED or that have a state like WAITING (on object monitor). These are threads that want to do work but can’t proceed. The real find is a deadlock. You’ll see lines in the dump that say “Found one Java-level deadlock.” Below that, it will spell out a tragic story. Thread A is holding Lock X and waiting for Lock Y. Thread B is holding Lock Y and waiting for Lock X. Neither can move forward. It’s a traffic jam at the software level. Finding this pattern gives you an immediate, precise target. You know which part of your code is causing the gridlock.
Memory problems are a different kind of beast. They often start quietly. You notice the garbage collector runs a little more often. A week later, it’s running constantly. Finally, the application grinds to a halt, spending all its time in “Full GC” and recovering less and less memory each time. This is a classic memory leak. To see it, you need a photograph of the heap—a moment frozen in time showing you every object living in memory.
That photograph is a heap dump. You can generate one with the jmap tool. A command like jmap -dump:live,format=b,file=heap.hprof 12345 creates a binary file called heap.hprof. The live option triggers a full garbage collection first, so you only see objects that are strongly held in memory—the real suspects. Be careful with this in production, as that full GC will pause your application.
The heap dump file is large and binary. You can’t read it directly. You need an analyzer. I often use Eclipse Memory Analyzer (MAT). You open the file in MAT, and it starts by offering a “Leak Suspects Report.” This is an excellent first step. It might say, “One instance of com.example.OrderCache loaded by WebAppClassLoader occupies 1,200,000,000 bytes (85%) of memory.” That’s a huge clue.
You can then drill down. MAT lets you see what this giant cache object is holding onto. You can follow the reference chain backward to see who holds the cache, and who holds that, all the way up to the “GC roots”—special objects that are always considered alive. Very often, you’ll find the cache is stored in a static field, or registered in a global listener list that was never cleaned up. Seeing this object graph makes the abstract concept of a “leak” painfully concrete.
Watching problems happen in real-time is much better than analyzing a crash scene after the fact. For that, you need a live dashboard into the JVM. Tools like JConsole and VisualVM are perfect for this. They use JMX, a built-in management system in the JVM.
To use them, you need to start your application with JMX enabled. It sounds more complicated than it is. You just add a few parameters:
java -Dcom.sun.management.jmxremote.port=9010 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-jar myapp.jar
This opens port 9010 for JMX connections without authentication or SSL (fine for a quick diagnostic on a secured internal network, but be careful). You then launch JConsole and connect to localhost:9010.
Suddenly, you have graphs. You see the heap usage rise and fall in a sawtooth pattern as the garbage collector does its work. You see how many threads are running. You see CPU usage. The power here is in the trends. Is the baseline of the heap slowly climbing after each GC, never returning to its original low? That’s a leak. Do you see a sudden vertical spike in thread count? Something just launched a lot of tasks. You can watch these graphs while you perform an action in the application and see the direct impact. This live correlation is invaluable.
Garbage collection logs are the definitive record of the JVM’s memory management. They tell a story of survival and reclamation. Modern JVMs use a unified logging system that is very powerful. You enable it with a command-line flag.
A common way to start is: java -Xlog:gc*,gc+heap=debug:file=gc.log -jar myapp.jar. This logs all GC events and heap details to a file. The logs look dense at first, but you learn to read them. You’ll see lines for “Young Generation” collections (usually fast) and “Full GC” events (usually slow and worrisome).
I look for a few key things. How long do the pauses last? A 50-millisecond pause every few minutes might be acceptable. A 2-second pause every 10 seconds is a crisis. How much memory is being recovered? If a Full GC runs but only recovers 10 megabytes from a 4-gigabyte heap, the heap is full of live objects—another sign of a leak. There are tools like GCViewer or online parsers that can ingest these logs and produce visual graphs, making these patterns even clearer. Tuning garbage collection is a deep topic, but the logs are the first step to knowing if you need to tune at all.
Sometimes, the application is just slow. CPU is pegged at 100%, but there’s no deadlock. Where is all that processing power going? This is where a CPU profiler comes in. My tool of choice for production is async-profiler. It’s designed to have very low overhead, which means you can usually run it on a live system without making the problem worse.
You download it, and run a command like ./profiler.sh -d 60 -f flamegraph.html 12345. This profiles the process for 60 seconds and outputs an HTML file with a flame graph.
A flame graph is a beautiful visual tool. It’s a horizontal chart where each rectangle is a method. The width of the rectangle shows how often that method was on the CPU stack. The vertical stacking shows the call hierarchy. You look for the widest platesaus. A big, wide block labeled HashMap.getNode or JSONParser.parse immediately tells you where the CPU is spending its cycles. I once found a “performance optimization” that was doing expensive string computation inside a comparator used for sorting a large list. The flame graph showed a single method eating 40% of the CPU. We fixed the comparator, and the CPU graph fell off a cliff.
You’ve deployed a new version of your application to your server. Then you deploy again a few hours later. After a few days of this, the server starts running out of memory, but your heap dumps show the Java heap is fine. This can be a ClassLoader leak, specifically in the Metaspace (where class metadata lives).
Frameworks that use reflection, proxies, or bytecode generation (like many ORMs or DI containers) can create classes dynamically. If something holds a reference to a class, its ClassLoader cannot be garbage collected. If that ClassLoader loaded your web application, all the classes from that application stay in memory. Redeploy, and you load a whole new, identical set of classes. Do this enough, and Metaspace fills up.
You can watch this by adding Metaspace logging: -Xlog:gc+metaspace*=debug. You’ll see the Metaspace usage grow with each deployment and never shrink. The fix is often tracking down what’s holding the reference. Common culprits are threads spawned by the application (like timer threads) that were not set to be daemon threads, or objects stored in ThreadLocal variables that are never cleaned up. Tools like MAT can also help you find these lingering ClassLoader references in a heap dump.
Your application is slow, but the CPU and heap look fine. It might be waiting on something outside itself. It could be a slow database query, a call to a remote API, or a clogged filesystem. You can investigate this by looking at what the application is doing from the operating system’s point of view.
The lsof command lists all files a process has open: lsof -p 12345. A surprisingly high number here might indicate files or sockets aren’t being closed. The netstat command shows network connections: netstat -tpn | grep 12345. Look for a large number of connections stuck in CLOSE_WAIT state. This means the remote side closed the connection, but your application’s socket hasn’t been closed yet. This is a classic sign of a resource leak—the code didn’t call close() on the socket or connection.
If the system is struggling to open new files or connections, you might hit the user limit. You can check that with ulimit -n. If your application needs 10,000 open files but the limit is 1024, it will fail mysteriously once it hits that ceiling.
The JVM uses memory outside the managed heap. This is called native memory. It’s used for things like thread stacks, direct byte buffers (used by NIO), and code cache. If this native memory grows uncontrollably, the operating system will eventually kill your process with an OutOfMemoryError (OS), which can be confusing if your Java heap has plenty of space left.
The JVM has a feature called Native Memory Tracking (NMT) to help. You enable it at startup: java -XX:NativeMemoryTracking=summary -jar myapp.jar. Later, when you want a report, you use the jcmd tool: jcmd 12345 VM.native_memory summary.
The report breaks down native memory by category: Java Heap, Class, Thread, Code, GC, and so on. I once saw an application where the “Internal” category was growing endlessly. It turned out to be a bug in a library that was allocating direct byte buffers in a loop and never releasing them. NMT pointed us straight to the category, and from there we found the specific code.
When all else fails, you need to see what the application is asking the operating system kernel to do, millisecond by millisecond. This is where strace (on Linux) comes in. It traces system calls and signals.
You can attach it to a running process: strace -f -tt -p 12345 -o strace.out. The -f follows any child threads/processes, -tt adds precise timestamps. The output goes to a file because it will be voluminous.
You then look through the log. Are there thousands of read() calls to a file? Maybe the app is repeatedly reading a small config file instead of caching it. Do you see long pauses on a poll() call? The application might be waiting for a network response. I’ve used this to diagnose a problem where an application was hanging for exactly 30 seconds. The strace output showed it was making a DNS lookup that was timing out. The JVM itself was stuck, not our code, waiting for a network configuration issue to resolve.
The most powerful technique of all is correlation. A problem happens. You have application logs that say “Request ABC123 failed.” You have a thread dump from that moment. You have a GC log entry from that same second. But are they related? If you can link them together, you have the full story.
The key is to propagate a common identifier through your system. In your application, at the very beginning of a request, generate a unique ID and put it in a managed context. In a web app, you can use a filter.
import org.slf4j.MDC; // Using SLF4J's Mapped Diagnostic Context
public void handleRequest(Request req, Response res) {
String requestId = generateId();
MDC.put("requestId", requestId); // This attaches the ID to all subsequent log statements
logger.info("Starting processing for {}", req.getPath());
try {
// ... process the request ...
} finally {
MDC.clear(); // Clean up at the end of the request
}
}
Now every log message for that request includes the ID. Some advanced profiling and monitoring tools can even tag their data (like a single slow method trace from async-profiler) with this same identifier. When you see an error for request ABC123, you can search all your logs, metrics, and thread dumps for that ID and reconstruct exactly what the JVM and your code were doing during that specific request’s lifetime. It turns disparate data points into a coherent narrative.
These techniques are not just for emergencies. Using them periodically on healthy systems builds your intuition. You learn what “normal” looks like for your application—the rhythm of its GC, the baseline thread count, the shape of its flame graph. Then, when something changes, you can spot the aberration immediately. It transforms you from someone who fights fires into someone who maintains a steady, predictable flame.