A system that never breaks is a fantasy. In the real world, networks fail, databases slow down, and third-party APIs vanish. The goal isn’t to prevent every failure—that’s impossible. The goal is to build services that bend instead of shatter. In my work with Java microservices, I’ve found that resilience isn’t a library you add; it’s a set of deliberate practices woven into your code. Let’s talk about ten techniques that move your services from fragile to robust.
Think of a circuit breaker like a fuse in your home’s electrical system. When a circuit is overloaded, the fuse blows to protect the wiring. In software, a circuit breaker stops calling a failing service to prevent your entire application from crashing. You can implement this pattern cleanly in Java.
First, you define the rules. How many failures are too many? How long should you wait before trying again? Here’s a practical setup using a popular library.
// Configure the behavior of your circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
// If 50% of calls fail, trip the breaker
.failureRateThreshold(50)
// Stay open for 30 seconds before testing the waters
.waitDurationInOpenState(Duration.ofSeconds(30))
// Analyze the last 10 calls to make decisions
.slidingWindowSize(10)
.build();
// Create the breaker instance for a specific service
CircuitBreaker circuitBreaker = CircuitBreaker.of("inventoryService", config);
// Wrap your vulnerable call
Supplier<String> protectedCall = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> callInventoryService());
// Execute it, with a plan for when it fails
String result = Try.ofSupplier(protectedCall)
.recover(throwable -> "Using cached inventory data")
.get();
The breaker has three states. Closed means everything is fine, and calls go through. Open means the service is failing, and calls are instantly rejected with a fallback. After the wait time, it moves to Half-Open, allowing one test call to see if the service is back. If that succeeds, it closes again. This simple pattern stops a single sick service from infecting your whole ecosystem.
Not all failures are permanent. Sometimes a network packet gets lost, or a service is restarting. For these temporary glitches, you can just try again. But you shouldn’t just hammer a service with immediate retries. That can make a bad situation worse. Instead, use a strategy that adds a delay between attempts.
This is called exponential backoff. You wait a little after the first failure, then a bit longer after the second, and so on. It gives the struggling system room to breathe. Let me show you how to set this up.
RetryConfig retryConfig = RetryConfig.custom()
// Don't try more than 3 times total
.maxAttempts(3)
// Start with a 100ms delay
.waitDuration(Duration.ofMillis(100))
// Double the wait time after each failure
.intervalFunction(IntervalFunction.ofExponentialBackoff())
// Only retry on specific errors, like timeouts
.retryOnException(e -> e instanceof TimeoutException)
.build();
Retry retry = Retry.of("paymentGateway", retryConfig);
// Wrap your call in this retry logic
CheckedFunction0<Receipt> retryablePayment = Retry
.decorateCheckedSupplier(retry, () -> chargeCreditCard());
Receipt r = Try.of(retryablePayment).get();
A critical warning: only retry operations that are safe to repeat. A “GET” request to fetch data is usually safe. A “POST” request to charge a credit card is not. For non-safe operations, you need other mechanisms, like idempotency keys, which is a topic for another day.
Imagine a ship with multiple watertight compartments. If one compartment floods, the others stay dry and the ship stays afloat. A bulkhead in your code does the same thing. It isolates resources, so a problem in one area doesn’t drain all your capacity.
A common mistake is using a shared thread pool for all outgoing calls. If one external service starts responding very slowly, threads get stuck waiting. Eventually, all threads are waiting, and your service can’t handle any requests, even for perfectly healthy features. Bulkheads prevent this.
BulkheadConfig config = BulkheadConfig.custom()
// Only allow 5 concurrent calls to the reporting service
.maxConcurrentCalls(5)
// If all slots are busy, wait up to 100ms for one to free up
.maxWaitDuration(Duration.ofMillis(100))
.build();
Bulkhead reportBulkhead = Bulkhead.of("reportService", config);
// Decorate your call
Supplier<Report> reportSupplier = () -> generateComplexReport();
Supplier<Report> protectedSupplier = Bulkhead
.decorateSupplier(reportBulkhead, reportSupplier);
// Execute. If the bulkhead is full, it will throw an exception.
Try.ofSupplier(protectedSupplier)
.onSuccess(report -> sendToUser(report))
.onFailure(e -> log.info("System busy, could not generate report"));
Now, even if generateComplexReport() becomes incredibly slow, it can only tie up 5 threads. Your user authentication, product catalog, and other functions continue to use the rest of the application’s resources. You’ve contained the failure.
Sometimes, you need to protect others from your own service, or protect yourself from external limits. A rate limiter controls how many requests can be made in a given time period. This is crucial when calling APIs that have strict quotas or when you want to smooth out sudden bursts of traffic.
Let’s say you integrate with an email service that allows 100 sends per minute. Going over this limit might get you blocked. A rate limiter enforces this rule at the client side.
RateLimiterConfig config = RateLimiterConfig.custom()
// Allow 100 calls...
.limitForPeriod(100)
// ...per 60-second window
.limitRefreshPeriod(Duration.ofSeconds(60))
// If the limit is hit, wait up to 500ms for permission
.timeoutDuration(Duration.ofMillis(500))
.build();
RateLimiter emailLimiter = RateLimiter.of("sendGrid", config);
// Wrap the email send operation
CheckedRunnable limitedSend = RateLimiter
.decorateCheckedRunnable(emailLimiter, this::dispatchEmail);
Try.run(limitedSend)
.onSuccess(v -> log.debug("Email sent"))
.onFailure(e -> {
// This happens if the wait time is exceeded
log.warn("Rate limit reached, email queued for later");
queueForRetry(email);
});
This pattern gives you predictable behavior. You operate within your allowed budget and avoid surprise errors or throttling from your dependencies.
One of the simplest yet most powerful rules: no call should wait forever. Always set a timeout. A timeout is a promise you make to the rest of your system: “I will give up after this long and free up my resources.” Without timeouts, a single slow downstream service can cause requests to pile up until your service collapses.
You can combine a timeout with a Future for clean handling.
// Define a 2-second timeout policy
TimeLimiterConfig config = TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofSeconds(2))
// This is important: cancel the underlying future
.cancelRunningFuture(true)
.build();
TimeLimiter limiter = TimeLimiter.of(config);
// The original slow call wrapped in a Future
Supplier<CompletableFuture<String>> futureSupplier = () -> CompletableFuture
.supplyAsync(this::callVerySlowLegacySystem);
// Apply the time limit to that future
Callable<String> timeLimitedCall = TimeLimiter
.decorateFutureSupplier(limiter, futureSupplier);
// Execute
String result = Try.ofCallable(timeLimitedCall)
.recover(throwable -> "Default value after timeout")
.get();
The key is cancelRunningFuture(true). This tells the system to interrupt the thread running the slow operation. Without this, the task might continue in the background, still consuming resources even though the caller has already moved on.
When something fails, what’s your backup plan? A fallback is that plan. It’s not about hiding the error; it’s about providing a useful, if degraded, experience. A good fallback is fast and reliable. It should never itself call another flaky service.
Consider a product details page. If the main database is slow, maybe you can show slightly stale data from a local cache.
public ProductDetails getProductDetails(String productId) {
return Try.ofSupplier(() -> productDb.fetchLatest(productId))
.recover(throwable -> {
// Log the failure for the ops team
log.error("Primary DB failed for {}", productId, throwable);
// Provide a graceful fallback for the user
return productCache.get(productId)
.orElse(ProductDetails.UNAVAILABLE);
})
.get();
}
The fallback here is a simple cache lookup. It might not have the newest price, but it lets the user see the product description and reviews. This is often better than a spinning loader or a generic error page.
When a request fails in a chain of ten microservices, finding the root cause is a nightmare without proper tracking. You need a way to follow a single request’s journey. This is done with a correlation ID, a unique string passed from service to service.
You can implement this with a servlet filter in a Spring Boot application.
@Component
public class CorrelationFilter extends OncePerRequestFilter {
@Override
protected void doFilterInternal(HttpServletRequest request,
HttpServletResponse response,
FilterChain chain) {
// Check if the incoming request already has an ID
String id = request.getHeader("X-Request-ID");
if (id == null || id.isBlank()) {
// If not, generate a new one
id = "req_" + UUID.randomUUID().toString();
}
// Store it in a thread-local context for logging
MDC.put("requestId", id);
// Pass it back in the response header
response.setHeader("X-Request-ID", id);
try {
chain.doFilter(request, response);
} finally {
// Clean up after the request is done
MDC.remove("requestId");
}
}
}
Now, configure your logging pattern (in logback-spring.xml for example) to include this ID.
<pattern>%d{ISO8601} [%thread] [%X{requestId}] %-5level %logger{36} - %msg%n</pattern>
Every log message from that request will have the same [req_1234-...] tag. When a user reports an error, you can search your logs for that ID and see exactly what happened at every step, across every service. This turns a multi-hour debugging session into a five-minute search.
In a Kubernetes or cloud environment, your service doesn’t just start and run. It gets probed. A readiness probe tells the platform, “I am ready to accept traffic.” This is different from being alive. A service can be running but not ready if its database connection is down.
Expose a simple endpoint that checks your vital connections.
@RestController
@Slf4j
public class ReadinessController {
@Autowired
private DataSource dataSource;
@Autowired
private RedisConnectionFactory redisFactory;
@GetMapping("/internal/ready")
public ResponseEntity<String> readinessCheck() {
List<String> errors = new ArrayList<>();
// Check database
try (Connection conn = dataSource.getConnection()) {
if (!conn.isValid(2)) {
errors.add("Database connection invalid");
}
} catch (SQLException e) {
errors.add("Database unavailable: " + e.getMessage());
}
// Check cache
try {
redisFactory.getConnection().ping();
} catch (Exception e) {
errors.add("Cache unavailable: " + e.getMessage());
}
if (errors.isEmpty()) {
return ResponseEntity.ok("READY");
} else {
log.warn("Readiness check failed: {}", errors);
return ResponseEntity.status(503) // Service Unavailable
.body("NOT READY: " + String.join(", ", errors));
}
}
}
The platform will stop sending user traffic to your service if this returns a 503 error. This prevents users from hitting a service that is doomed to fail because a critical dependency is missing. It’s a contract between your service and the infrastructure.
What happens when your service is the one under heavy load? You need a way to push back. This is called handling backpressure. You can’t process an infinite number of requests. A good strategy is to reject new requests gracefully when you’re at capacity, rather than letting them queue until everything times out.
A practical way is to use a bounded queue with a rejection policy in your thread pool.
// Create an executor with strict limits
ThreadPoolExecutor healthCheckExecutor = new ThreadPoolExecutor(
4, // Keep 4 core threads ready
4, // Never use more than 4 threads total
0L, TimeUnit.MILLISECONDS,
new LinkedBlockingQueue<>(20), // Buffer up to 20 tasks in a queue
new ThreadPoolExecutor.AbortPolicy() // Reject when queue is full
);
public CompletableFuture<HealthStatus> performHealthCheck(UserRequest req) {
return CompletableFuture.supplyAsync(() -> runIntensiveCheck(req), healthCheckExecutor)
.exceptionally(throwable -> {
if (throwable.getCause() instanceof RejectedExecutionException) {
// We are at capacity. Communicate this clearly.
return HealthStatus.builder()
.status("OVERLOADED")
.message("System busy, please try shortly")
.build();
}
// Some other error
return HealthStatus.error(throwable);
});
}
By returning a clear “OVERLOADED” status or an HTTP 429 (Too Many Requests) code, you tell the client to back off and try later. This is honest and allows the overall system to stabilize. It’s better than silently failing or becoming unresponsive.
All these patterns are useless if you don’t test them. You must verify that your circuit breaker opens under failure, your retries actually happen, and your fallbacks provide correct data. This requires simulating failure, which is a different kind of testing.
You can use libraries to mock external services and make them fail or be slow on command. Here’s a basic test for a circuit breaker scenario using Spring’s test utilities.
@SpringBootTest(webEnvironment = WebEnvironment.RANDOM_PORT)
class OrderServiceResilienceTest {
@Autowired
private TestRestTemplate restTemplate;
@MockBean
private PaymentClient paymentClient; // This client calls the external service
@Test
void circuitBreakerOpensAndProvidesFallback() throws Exception {
// 1. Simulate repeated failures from the payment service
when(paymentClient.charge(any()))
.thenThrow(new RuntimeException("Payment service down"));
// Make the initial calls that should cause failures
for (int i = 0; i < 10; i++) {
ResponseEntity<OrderResponse> response = restTemplate.postForEntity(
"/api/order",
new OrderRequest(),
OrderResponse.class
);
// The first few might return the error, or a fallback after a failure
}
// 2. After enough failures, the breaker should be OPEN.
// The next call should instantly get the fallback without calling the mock.
ResponseEntity<OrderResponse> finalResponse = restTemplate.postForEntity(
"/api/order",
new OrderRequest(),
OrderResponse.class
);
// Verify the fallback content
assertThat(finalResponse.getBody().getStatus()).isEqualTo("ORDER_RECEIVED_PAYMENT_PENDING");
// Verify the mocked client was NOT called again (breaker is open)
verify(paymentClient, times(10)).charge(any()); // Only called the first 10 times
}
}
This test proves your service’s behavior under duress. You should write similar tests for timeouts, bulkhead saturation, and retry logic. Resilience is a feature, and like any feature, it needs validation.
Building resilient services is an exercise in humility. You accept that things will go wrong. Your job is to design the way your system responds to that reality. By using circuit breakers, bulkheads, timeouts, and clear fallbacks, you build services that are still useful even when they’re not perfect. You give your system the strength to handle storms, and in doing so, you build trust with the users who depend on it. Start by picking one pattern—maybe timeouts—and implementing it consistently. Then add another. Resilience is a journey, taken one deliberate step at a time.