Your computer’s processor is faster than ever, but its memory hasn’t kept up. This has created a hidden bottleneck that many programmers miss. The fastest part of your program isn’t the clever algorithm; it’s the part that doesn’t have to wait for data.
Think of it like a workspace. The CPU’s registers are the tools in your hands—instantly available. The L1 and L2 caches are the tools on your desk—a slight reach. The main RAM is a warehouse across town. Every trip to that warehouse costs hundreds of desk-reach cycles. Modern performance is about keeping what you need on the desk.
Rust gives you an unusual level of control over where and how data is placed in that warehouse. You can arrange your data so that when your CPU goes to get something, it brings back exactly what you need next, right onto the desk. This is writing cache-friendly code. It’s not about doing less work; it’s about waiting less for the work to arrive.
Let’s look at how to do this. The first step happens before you write a single algorithm. It’s in the very shape of your data. When you define a struct in Rust, the compiler is allowed to rearrange your fields in memory to satisfy alignment requirements. This can create empty gaps, or “padding,” between fields.
These gaps waste space, but more importantly, they can push the fields you care about onto different “desk drawers” (cache lines). A cache line is the unit of data transferred from memory to cache, typically 64 bytes. You want the data you use together to live in the same drawer.
You can guide the compiler. By ordering your fields from those with the largest alignment requirement to the smallest, you minimize padding. The #[repr(C)] attribute tells Rust to use a predictable, compact layout, similar to C. This guarantees your field order is preserved.
// This struct looks innocent, but it's wasting your cache.
struct BadLayout {
tiny_flag: bool, // 1 byte. Needs 1-byte alignment.
// The compiler will likely add 7 bytes of padding here
// so the next field can start at an 8-byte boundary.
important_value: f64, // 8 bytes. Needs 8-byte alignment.
another_number: u32, // 4 bytes.
}
// This is the same data, arranged thoughtfully.
#[repr(C)] // "Keep the fields in the order I wrote them."
struct GoodLayout {
important_value: f64, // 8 bytes first.
another_number: u32, // 4 bytes next.
tiny_flag: bool, // 1 byte last.
// The compiler might add 3 bytes at the end to make the whole struct
// align to 8 bytes, but that's less harmful than padding in the middle.
}
fn main() {
use std::mem::size_of;
println!("BadLayout takes up {} bytes.", size_of::<BadLayout>()); // Probably 24.
println!("GoodLayout takes up {} bytes.", size_of::<GoodLayout>()); // Probably 16.
}
When you have a collection of items, how you store them defines your performance. For sequential access, the humble array or Vec is king. Data in an array is stored contiguously, one item right after the other in memory. When the CPU fetches the first element, it grabs the entire 64-byte cache line it sits in, which likely contains the next several elements for free.
Now consider a linked list. Each node is allocated separately on the heap. The next pointer might lead to a node in a completely different part of memory. Following that pointer is a guaranteed trip to a new, random desk drawer, often a slow journey to the warehouse. The CPU’s prefetcher, which tries to guess and load data you’ll need next, is utterly defeated.
// Fast for iteration: A simple array of points.
struct Path {
points: [(f32, f32); 1000], // All 1000 points are in a neat block.
}
// Slow for iteration: A linked list.
struct PathNode {
x: f32,
y: f32,
next: Option<Box<PathNode>>, // The next point is... somewhere else.
}
fn total_distance_array(path: &Path) -> f32 {
let mut sum = 0.0;
// This loop flows smoothly through memory.
// The CPU prefetches the next cache line while you work on the current one.
for window in path.points.windows(2) {
let (x1, y1) = window[0];
let (x2, y2) = window[1];
sum += ((x2-x1).powi(2) + (y2-y1).powi(2)).sqrt();
}
sum
}
Sometimes, you don’t need all the fields of a struct at once. Imagine a physics simulation with thousands of particles. If your inner loop only updates velocities, but your struct stores position, velocity, and mass together (an Array of Structs, or AoS), you’re wasting cache space. For every velocity you load, you also load a position and mass you aren’t using yet, pushing other velocities out of cache.
The solution is to invert the structure. Use a Struct of Arrays (SoA). Store all velocities together in one array, all positions together in another. Now, when your velocity-update loop runs, every single byte it fetches is a velocity. The cache is packed with pure, useful data.
// Array of Structs (AoS). Convenient, but can be inefficient for field-wise ops.
#[derive(Default)]
struct ParticleAoS {
pos_x: f32,
pos_y: f32,
vel_x: f32,
vel_y: f32,
mass: f32,
}
// Struct of Arrays (SoA). Optimal for bulk operations on specific fields.
struct ParticleSystemSoA {
// Each of these vectors is dense with only one kind of data.
pos_x: Vec<f32>,
pos_y: Vec<f32>,
vel_x: Vec<f32>,
vel_y: Vec<f32>,
mass: Vec<f32>,
}
impl ParticleSystemSoA {
fn update_velocities(&mut self, dt: f32) {
// This loop screams through memory. The `vel_x` data is contiguous.
// It's easy for the compiler to transform this into SIMD instructions.
for vx in &mut self.vel_x {
*vx += dt * 9.81;
}
// Same for `vel_y`.
for vy in &mut self.vel_y {
*vy += dt * 9.81;
}
}
}
The order in which you touch memory is as important as how it’s laid out. CPUs are brilliant at predicting simple, linear paths. If you are accessing memory addresses 100, 104, 108, 112…, the prefetcher will start pulling data long before you ask for it. This is sequential locality.
Random access scatters this predictability. Iterating over a collection by jumping to indexes provided by another list, or traversing a hash map, creates a chaotic pattern. Each access is a potential cache miss. If you must have random access, try to sort your access patterns first. Even a rough sort can turn chaotic jumps into short, forward strides.
fn process_data_sequentially(data: &[f64]) -> f64 {
// The gold standard. The CPU sees this coming a mile away.
data.iter().sum()
}
fn process_with_random_indices(indices: &[usize], data: &[f64]) -> f64 {
// This is a cache nightmare. Each `data[i]` could be anywhere.
indices.iter().map(|&i| data[i]).sum()
}
fn process_with_sorted_indices(mut indices: Vec<usize>, data: &[f64]) -> f64 {
// A simple sort groups accesses to similar memory locations.
// It transforms random jumps into several small, sequential passes.
indices.sort_unstable(); // Cost paid once.
indices.iter().map(|&i| data[i]).sum() // Benefit gained many times.
}
Objects created together in time often live together in memory. The default allocator tries to be efficient, and new allocations frequently end up in nearby addresses. You can exploit this. If you know a group of objects will be processed together, try to create them in a batch.
For this, arena allocators or object pools are powerful tools. Instead of asking the global allocator for each object individually, you allocate a large block of memory (the arena) and hand out slices of it. Every object from that arena is local to its neighbors, giving you excellent spatial locality when you iterate over them.
use bumpalo::Bump; // A simple arena allocator crate.
struct GraphNode<'a> {
id: usize,
neighbors: Vec<&'a GraphNode<'a>>, // References valid for the arena's lifetime.
}
fn build_tightly_packed_graph() {
// Create an arena. All memory for our nodes comes from here.
let arena = Bump::new();
let mut nodes = Vec::new();
for i in 0..5000 {
// Allocate the node from the arena, not the heap.
let node = arena.alloc(GraphNode { id: i, neighbors: Vec::new() });
nodes.push(node);
}
// Now `nodes` is a `Vec` of pointers, but those pointers point to
// locations within one or two dense memory blocks in the arena.
// Iterating has much better locality than 5000 separate `Box` allocations.
}
When multiple threads are involved, a new problem emerges: false sharing. Imagine two CPU cores, each with its own L1 cache. If Core 1 writes to a variable on Cache Line A, and Core 2 writes to a different variable that happens to be on the same Cache Line A, the cache system sees the whole line as dirty. It forces Core 2 to invalidate its copy and fetch the updated line, even though the other variable it cares about hasn’t changed. This is devastating for performance.
The fix is to ensure that highly contended, write-heavy data from different threads lives on separate cache lines. You can do this with alignment. The #[repr(align(64))] attribute forces a type to start on a 64-byte boundary—the start of a cache line.
use std::sync::atomic::{AtomicU64, Ordering};
use std::thread;
// This will suffer from false sharing.
struct NaiveCounter {
count_a: AtomicU64,
count_b: AtomicU64, // Probably within 8 bytes of count_a. Same cache line.
}
// This isolates each counter.
#[repr(align(64))] // This struct must start at the beginning of a cache line.
struct CacheAligned(AtomicU64);
struct IsolatedCounters {
// Each of these is guaranteed to be on its own cache line.
counter_a: CacheAligned,
counter_b: CacheAligned,
}
fn main() {
let counters = IsolatedCounters {
counter_a: CacheAligned(AtomicU64::new(0)),
counter_b: CacheAligned(AtomicU64::new(0)),
};
let thread_a = thread::spawn({
let c = &counters.counter_a.0;
move || for _ in 0..1_000_000 { c.fetch_add(1, Ordering::Relaxed); }
});
let thread_b = thread::spawn({
let c = &counters.counter_b.0;
move || for _ in 0..1_000_000 { c.fetch_add(1, Ordering::Relaxed); }
});
thread_a.join().unwrap();
thread_b.join().unwrap();
// These threads will not slow each other down with cache invalidations.
}
Keep small, frequently used data on the stack or in a temporary array. The stack is almost always in the fastest levels of cache. If you have a small lookup table, a filter kernel, or a set of coefficients used millions of times in a loop, compute it once and store it locally.
Avoid recalculating the same thing or fetching it from a faraway struct field inside your hottest loop. Pull it into a local variable at the start of the loop. The compiler often does this for you (an optimization called “promotion to register”), but being explicit helps you reason about performance.
fn complex_convolution(signal: &[f32]) -> Vec<f32> {
// This small kernel is used for every single output point.
// Calculate it once, here. It will live in a register or L1 cache.
let kernel: [f32; 5] = [
0.0625, 0.25, 0.375, 0.25, 0.0625
];
let mut result = Vec::with_capacity(signal.len().saturating_sub(4));
// The inner loops access `kernel` repeatedly. It's already right there.
for window in signal.windows(5) {
let sum: f32 = window.iter()
.zip(kernel.iter())
.map(|(s, k)| s * k)
.sum();
result.push(sum);
}
result
}
Finally, trust your compiler. Write clear, sequential loops that operate on slices. The Rust compiler, via LLVM, is excellent at auto-vectorization—transforming your scalar operations into Single Instruction, Multiple Data (SIMD) instructions that can process 4, 8, or 16 pieces of data at once.
This only works when the data is contiguous and the access pattern is predictable. Using iterator adapters like .zip(), .map(), and .sum() often creates the perfect conditions for the compiler to work its magic. Methods like .chunks_exact() allow you to manually work with aligned blocks of data, which can be even more efficient.
fn compute_l2_norm(a: &[f32], b: &[f32]) -> f32 {
// This is the kind of clean, parallel operation compilers love.
a.iter()
.zip(b.iter())
.map(|(&x, &y)| (x - y).powi(2))
.sum::<f32>()
.sqrt()
}
// A more explicit pattern for manual control.
fn sum_blocks(data: &[i32], block_size: usize) -> i32 {
data.chunks(block_size) // Break into contiguous blocks.
.map(|block| block.iter().sum::<i32>()) // Sum each block.
.sum() // Sum the block totals.
}
Performance tuning can feel abstract, but cache efficiency is tangible. It’s the difference between a stuttering simulation and a smooth one, between a responsive server and a sluggish one. In Rust, you’re not just writing instructions; you’re organizing a workspace. You decide what goes on the desk, what stays in the drawer, and what gets a drawer all to itself. By thinking about data layout, access patterns, and locality from the start, you write code that doesn’t just run, but runs with the grain of the hardware. It’s where the true speed of Rust is often found.