Mastering Rust's Atomics: Build Lightning-Fast Lock-Free Data Structures

ruby

Mastering Rust's Atomics: Build Lightning-Fast Lock-Free Data Structures

Explore Rust's advanced atomics for lock-free programming. Learn to create high-performance concurrent data structures and optimize multi-threaded systems.

Dec 6, 2024

Mastering Rust's Atomics: Build Lightning-Fast Lock-Free Data Structures

Rust’s advanced atomics open up a whole new world of possibilities for creating lock-free data structures. I’ve spent countless hours exploring this fascinating area, and I’m excited to share what I’ve learned.

Let’s start with the basics. Atomics in Rust are special types that guarantee atomic operations, meaning they can be safely shared between threads without the need for locks. These types include AtomicBool, AtomicI32, AtomicUsize, and more.

The real power of atomics comes from their ability to perform operations that are both atomic and ordered. This is where memory ordering comes into play. Rust provides several memory ordering options: Relaxed, Release, Acquire, AcqRel, and SeqCst. Each of these has different guarantees and performance implications.

I’ve found that Relaxed ordering is often sufficient for simple counters or flags, while Release and Acquire are crucial for implementing more complex data structures. SeqCst, the strictest ordering, is rarely needed but can be useful in certain scenarios.

Now, let’s dive into creating a lock-free stack. This is a great starting point for understanding how to use atomics in practice:

use std::sync::atomic::{AtomicPtr, Ordering};
use std::ptr;

struct Node<T> {
    data: T,
    next: *mut Node<T>,
}

pub struct Stack<T> {
    head: AtomicPtr<Node<T>>,
}

impl<T> Stack<T> {
    pub fn new() -> Self {
        Stack {
            head: AtomicPtr::new(ptr::null_mut()),
        }
    }

    pub fn push(&self, data: T) {
        let new_node = Box::into_raw(Box::new(Node {
            data,
            next: ptr::null_mut(),
        }));

        loop {
            let old_head = self.head.load(Ordering::Relaxed);
            unsafe {
                (*new_node).next = old_head;
            }
            match self.head.compare_exchange_weak(
                old_head,
                new_node,
                Ordering::Release,
                Ordering::Relaxed,
            ) {
                Ok(_) => break,
                Err(_) => continue,
            }
        }
    }

    pub fn pop(&self) -> Option<T> {
        loop {
            let head = self.head.load(Ordering::Acquire);
            if head.is_null() {
                return None;
            }
            let next = unsafe { (*head).next };
            match self.head.compare_exchange_weak(
                head,
                next,
                Ordering::Release,
                Ordering::Relaxed,
            ) {
                Ok(_) => {
                    let data = unsafe { Box::from_raw(head).data };
                    return Some(data);
                }
                Err(_) => continue,
            }
        }
    }
}

This lock-free stack uses AtomicPtr to manage the head of the stack. The push and pop operations use compare_exchange_weak to atomically update the head pointer.

One challenge I encountered when working with lock-free structures is the ABA problem. This occurs when a value changes from A to B and back to A, potentially causing issues in algorithms that assume unchanged values. To combat this, we can use techniques like hazard pointers or epoch-based reclamation.

Let’s look at a simple implementation of hazard pointers:

use std::sync::atomic::{AtomicPtr, AtomicUsize, Ordering};
use std::ptr;

const MAX_HAZARD_POINTERS: usize = 100;

struct HazardPointer {
    pointer: AtomicPtr<u8>,
}

struct HazardPointerList {
    list: [HazardPointer; MAX_HAZARD_POINTERS],
    next_free: AtomicUsize,
}

impl HazardPointerList {
    fn new() -> Self {
        HazardPointerList {
            list: array![HazardPointer { pointer: AtomicPtr::new(ptr::null_mut()) }; MAX_HAZARD_POINTERS],
            next_free: AtomicUsize::new(0),
        }
    }

    fn acquire(&self, ptr: *mut u8) -> usize {
        loop {
            let index = self.next_free.fetch_add(1, Ordering::Relaxed) % MAX_HAZARD_POINTERS;
            let old = self.list[index].pointer.swap(ptr, Ordering::AcqRel);
            if old.is_null() {
                return index;
            }
            self.next_free.fetch_sub(1, Ordering::Relaxed);
        }
    }

    fn release(&self, index: usize) {
        self.list[index].pointer.store(ptr::null_mut(), Ordering::Release);
    }

    fn is_hazardous(&self, ptr: *mut u8) -> bool {
        for hp in &self.list {
            if hp.pointer.load(Ordering::Acquire) == ptr {
                return true;
            }
        }
        false
    }
}

This hazard pointer implementation helps prevent the ABA problem by ensuring that no thread is currently using a pointer before it’s freed.

Another interesting aspect of lock-free programming is the concept of wait-free algorithms. These guarantee that every operation completes in a finite number of steps, regardless of the actions of other threads. Implementing wait-free algorithms can be challenging, but they provide the strongest progress guarantee.

Here’s a simple example of a wait-free counter:

use std::sync::atomic::{AtomicUsize, Ordering};

struct WaitFreeCounter {
    value: AtomicUsize,
}

impl WaitFreeCounter {
    fn new() -> Self {
        WaitFreeCounter {
            value: AtomicUsize::new(0),
        }
    }

    fn increment(&self) -> usize {
        self.value.fetch_add(1, Ordering::Relaxed)
    }

    fn get(&self) -> usize {
        self.value.load(Ordering::Relaxed)
    }
}

This counter is wait-free because both increment and get operations complete in a constant number of steps, regardless of contention.

When working with lock-free data structures, memory management becomes a critical concern. We need to ensure that memory is reclaimed safely without causing use-after-free or memory leak issues. One popular approach is epoch-based reclamation.

The basic idea behind epoch-based reclamation is to divide time into epochs. Threads entering a critical section increment a global epoch counter and announce their current epoch. When freeing memory, instead of immediately deallocating, we add it to a list associated with the current epoch. Memory is only reclaimed when all threads have moved past the epoch in which the memory was freed.

Here’s a simplified implementation of epoch-based reclamation:

use std::sync::atomic::{AtomicUsize, Ordering};
use std::cell::UnsafeCell;

const EPOCH_COUNT: usize = 3;

struct Epoch {
    current: AtomicUsize,
    participants: [AtomicUsize; EPOCH_COUNT],
    garbage: [UnsafeCell<Vec<*mut u8>>; EPOCH_COUNT],
}

impl Epoch {
    fn new() -> Self {
        Epoch {
            current: AtomicUsize::new(0),
            participants: Default::default(),
            garbage: Default::default(),
        }
    }

    fn enter(&self) -> usize {
        let epoch = self.current.load(Ordering::Relaxed);
        self.participants[epoch].fetch_add(1, Ordering::Relaxed);
        epoch
    }

    fn exit(&self, epoch: usize) {
        self.participants[epoch].fetch_sub(1, Ordering::Relaxed);
    }

    fn try_advance(&self) {
        let current = self.current.load(Ordering::Relaxed);
        let next = (current + 1) % EPOCH_COUNT;
        if self.participants[next].load(Ordering::Relaxed) == 0 {
            self.current.compare_exchange_weak(
                current,
                next,
                Ordering::SeqCst,
                Ordering::Relaxed,
            ).ok();
        }
    }

    fn add_garbage(&self, ptr: *mut u8) {
        let epoch = self.current.load(Ordering::Relaxed);
        unsafe {
            (*self.garbage[epoch].get()).push(ptr);
        }
    }

    fn collect_garbage(&self) {
        let safe_epoch = (self.current.load(Ordering::Relaxed) + 1) % EPOCH_COUNT;
        for ptr in unsafe { &mut *self.garbage[safe_epoch].get() } {
            unsafe {
                drop(Box::from_raw(*ptr));
            }
        }
        unsafe {
            (*self.garbage[safe_epoch].get()).clear();
        }
    }
}

This epoch-based reclamation system allows for safe memory management in lock-free data structures. It’s particularly useful for structures like our earlier lock-free stack, where nodes need to be safely deallocated.

One area where lock-free programming really shines is in high-performance systems. I’ve used these techniques to create lightning-fast concurrent hash maps, queues, and even more complex data structures like skip lists.

For example, here’s the skeleton of a lock-free hash map:

use std::sync::atomic::{AtomicPtr, Ordering};
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;

struct Node<K, V> {
    key: K,
    value: V,
    next: AtomicPtr<Node<K, V>>,
}

struct Bucket<K, V> {
    head: AtomicPtr<Node<K, V>>,
}

pub struct LockFreeHashMap<K, V> {
    buckets: Box<[Bucket<K, V>]>,
}

impl<K: Hash + Eq, V> LockFreeHashMap<K, V> {
    pub fn new(capacity: usize) -> Self {
        let mut buckets = Vec::with_capacity(capacity);
        for _ in 0..capacity {
            buckets.push(Bucket { head: AtomicPtr::new(std::ptr::null_mut()) });
        }
        LockFreeHashMap { buckets: buckets.into_boxed_slice() }
    }

    fn hash(&self, key: &K) -> usize {
        let mut hasher = DefaultHasher::new();
        key.hash(&mut hasher);
        (hasher.finish() as usize) % self.buckets.len()
    }

    pub fn insert(&self, key: K, value: V) -> Option<V> {
        let bucket = &self.buckets[self.hash(&key)];
        let new_node = Box::into_raw(Box::new(Node {
            key,
            value,
            next: AtomicPtr::new(std::ptr::null_mut()),
        }));

        loop {
            let head = bucket.head.load(Ordering::Acquire);
            unsafe {
                (*new_node).next.store(head, Ordering::Relaxed);
            }
            match bucket.head.compare_exchange_weak(
                head,
                new_node,
                Ordering::Release,
                Ordering::Relaxed,
            ) {
                Ok(_) => return None,
                Err(_) => continue,
            }
        }
    }

    // Implement get, remove, etc.
}

This hash map uses separate chaining with lock-free linked lists for each bucket. The insert operation is lock-free, allowing multiple threads to insert into different buckets simultaneously.

When implementing these complex data structures, it’s crucial to consider the impact of cache coherence. False sharing, where threads inadvertently contend for the same cache line despite accessing different data, can significantly impact performance. To mitigate this, we can use techniques like padding or aligning our data structures to cache line boundaries.

Here’s an example of how we might pad our Bucket struct to avoid false sharing:

use std::sync::atomic::AtomicPtr;

const CACHE_LINE_SIZE: usize = 64;

#[repr(align(64))]
struct Bucket<K, V> {
    head: AtomicPtr<Node<K, V>>,
    _padding: [u8; CACHE_LINE_SIZE - std::mem::size_of::<AtomicPtr<Node<K, V>>>()],
}

This ensures that each Bucket occupies its own cache line, preventing false sharing between buckets.

As we delve deeper into lock-free programming, we encounter more advanced concepts like helping and elimination. Helping involves threads assisting each other to complete operations, which can improve overall system throughput. Elimination, often used in stack implementations, allows push and pop operations to “cancel out” without accessing the main data structure.

Here’s a sketch of how elimination might work in a stack:

use std::sync::atomic::{AtomicPtr, AtomicBool, Ordering};
use std::time::Duration;
use std::thread;

struct EliminationArray<T> {
    slots: Box<[EliminationSlot<T>]>,
}

struct EliminationSlot<T> {
    value: AtomicPtr<T>,
    busy: AtomicBool,
}

impl<T> EliminationArray<T> {
    fn try_eliminate(&self, value: *mut T, is_push: bool) -> Option<*mut T> {
        let slot = &self.slots[thread::current().id().as_u64() as usize % self.slots.len()];
        if slot.busy.compare_exchange(false, true, Ordering::Acquire, Ordering::Relaxed).is_ok() {
            if is_push {
                slot.value.store(value, Ordering::Release);
                thread::sleep(Duration::from_micros(1));
                match slot.value.compare_exchange(value, std::ptr::null_mut(), Ordering::AcqRel, Ordering::Relaxed) {
                    Ok(_) => {
                        slot.busy.store(false, Ordering::Release);
                        None
                    }
                    Err(_) => {
                        let result = slot.value.swap(std::ptr::null_mut(), Ordering::AcqRel);
                        slot.busy.store(false, Ordering::Release);
                        Some(result)
                    }
                }
            } else {
                thread::sleep(Duration::from_micros(1));
                let result = slot.value.swap(std::ptr::null_mut(), Ordering::AcqRel);
                slot.busy.store(false, Ordering::Release);
                if result.is_null() {
                    None
                } else {
                    Some(result)
                }
            }
        } else {
            None
        }
    }
}

This elimination array allows push and pop operations to potentially complete without touching the main stack, reducing contention on the stack’s head.

As we push the boundaries of concurrent programming, it’s important to remember that with great power comes great responsibility. Lock-free programming can lead to significant performance improvements, but it also introduces complexity and potential for subtle bugs. Proper testing, including stress testing and use of tools like ThreadSanitizer, is crucial.

In conclusion, Rust’s advanced atomics provide a powerful toolkit for creating high-performance, lock-free data structures. By mastering these techniques, we can craft concurrent code that pushes the limits of what’s possible in systems programming. Whether you’re building a high-frequency trading system, a real-time game engine, or just trying to squeeze every last bit of performance out of your multi-core processor, these lock-free techniques can make a real difference. So go forth, experiment, and may your code be forever race-condition free!