**High-Frequency Trading: 8 Zero-Copy Serialization Techniques for Nanosecond Performance in Rust**

rust

High-Frequency Trading: 8 Zero-Copy Serialization Techniques for Nanosecond Performance in Rust

Learn 8 advanced zero-copy serialization techniques for high-frequency trading: memory alignment, fixed-point arithmetic, SIMD operations & more in Rust. Reduce latency to nanoseconds.

Jun 4, 2025

**High-Frequency Trading: 8 Zero-Copy Serialization Techniques for Nanosecond Performance in Rust**

Memory-aligned message structures form the foundation of efficient zero-copy serialization in high-frequency trading systems. When I first encountered latency issues in trading applications, the solution became clear: data alignment directly impacts CPU cache performance and memory access patterns.

The key lies in structuring messages to match hardware cache lines and memory boundaries. Modern CPUs typically use 64-byte cache lines, so aligning message structures to these boundaries prevents cache line splits and reduces memory access latency.

#[repr(C, align(64))]
struct MarketDataMessage {
    header: MessageHeader,
    symbol: [u8; 8],
    price: u64,
    quantity: u32,
    timestamp: u64,
    _padding: [u8; 12],
}

#[repr(C)]
struct MessageHeader {
    message_type: u8,
    sequence: u32,
    length: u16,
    checksum: u8,
}

impl MarketDataMessage {
    fn as_bytes(&self) -> &[u8] {
        unsafe {
            std::slice::from_raw_parts(
                self as *const Self as *const u8,
                std::mem::size_of::<Self>()
            )
        }
    }
    
    fn from_bytes(data: &[u8]) -> Option<&Self> {
        if data.len() < std::mem::size_of::<Self>() {
            return None;
        }
        
        let ptr = data.as_ptr() as *const Self;
        if ptr as usize % std::mem::align_of::<Self>() != 0 {
            return None;
        }
        
        Some(unsafe { &*ptr })
    }
}

This approach eliminates the overhead of traditional serialization libraries by treating structs as raw byte arrays. The alignment guarantee ensures optimal memory access patterns, while the safety checks in from_bytes prevent undefined behavior from misaligned data.

Fixed-point arithmetic represents another crucial optimization for trading systems. Floating-point operations introduce non-deterministic behavior and performance overhead that high-frequency trading cannot tolerate. Converting prices to fixed-point representation during serialization eliminates these issues.

#[derive(Copy, Clone)]
struct FixedPrice {
    value: i64,
}

impl FixedPrice {
    const SCALE: i64 = 1_000_000;
    
    fn from_f64(price: f64) -> Self {
        Self {
            value: (price * Self::SCALE as f64) as i64,
        }
    }
    
    fn to_f64(self) -> f64 {
        self.value as f64 / Self::SCALE as f64
    }
    
    fn serialize(&self, buffer: &mut [u8]) -> usize {
        let bytes = self.value.to_le_bytes();
        buffer[..8].copy_from_slice(&bytes);
        8
    }
    
    fn deserialize(buffer: &[u8]) -> (Self, usize) {
        let value = i64::from_le_bytes([
            buffer[0], buffer[1], buffer[2], buffer[3],
            buffer[4], buffer[5], buffer[6], buffer[7]
        ]);
        
        (Self { value }, 8)
    }
}

struct OrderMessage {
    order_id: u64,
    price: FixedPrice,
    quantity: u32,
    side: OrderSide,
}

#[repr(u8)]
enum OrderSide {
    Buy = 0,
    Sell = 1,
}

The fixed-point approach maintains precision while enabling predictable serialization performance. Each price conversion becomes a simple integer operation, and the serialization process involves direct byte copying without floating-point calculations.

Ring buffer message passing eliminates the synchronization overhead that traditional queues introduce. When building trading systems, I discovered that lock-free data structures significantly reduce tail latency compared to mutex-based alternatives.

use std::sync::atomic::{AtomicUsize, Ordering};

struct RingBuffer<T> {
    buffer: Vec<std::mem::MaybeUninit<T>>,
    capacity: usize,
    write_pos: AtomicUsize,
    read_pos: AtomicUsize,
}

impl<T> RingBuffer<T> {
    fn new(capacity: usize) -> Self {
        let mut buffer = Vec::with_capacity(capacity);
        buffer.resize_with(capacity, || std::mem::MaybeUninit::uninit());
        
        Self {
            buffer,
            capacity,
            write_pos: AtomicUsize::new(0),
            read_pos: AtomicUsize::new(0),
        }
    }
    
    fn try_write(&self, item: T) -> Result<(), T> {
        let write_pos = self.write_pos.load(Ordering::Relaxed);
        let next_write = (write_pos + 1) % self.capacity;
        let read_pos = self.read_pos.load(Ordering::Acquire);
        
        if next_write == read_pos {
            return Err(item);
        }
        
        unsafe {
            let slot = &self.buffer[write_pos] as *const _ as *mut std::mem::MaybeUninit<T>;
            (*slot).write(item);
        }
        
        self.write_pos.store(next_write, Ordering::Release);
        Ok(())
    }
    
    fn try_read(&self) -> Option<T> {
        let read_pos = self.read_pos.load(Ordering::Relaxed);
        let write_pos = self.write_pos.load(Ordering::Acquire);
        
        if read_pos == write_pos {
            return None;
        }
        
        let item = unsafe {
            let slot = &self.buffer[read_pos] as *const _ as *mut std::mem::MaybeUninit<T>;
            (*slot).assume_init_read()
        };
        
        let next_read = (read_pos + 1) % self.capacity;
        self.read_pos.store(next_read, Ordering::Release);
        
        Some(item)
    }
}

This implementation uses memory ordering constraints to maintain consistency without locks. The acquire-release semantics ensure that writes become visible to readers at the correct time, while the relaxed ordering for position loads reduces CPU overhead.

Template-based message encoding leverages Rust’s const generics to generate optimized serialization code at compile time. This technique eliminates runtime branching and enables aggressive compiler optimizations.

trait FieldEncoder<const SIZE: usize> {
    fn encode(&self, buffer: &mut [u8; SIZE]) -> usize;
    fn decode(buffer: &[u8; SIZE]) -> (Self, usize) where Self: Sized;
}

impl FieldEncoder<8> for u64 {
    fn encode(&self, buffer: &mut [u8; 8]) -> usize {
        *buffer = self.to_le_bytes();
        8
    }
    
    fn decode(buffer: &[u8; 8]) -> (Self, usize) {
        (u64::from_le_bytes(*buffer), 8)
    }
}

impl FieldEncoder<4> for u32 {
    fn encode(&self, buffer: &mut [u8; 4]) -> usize {
        *buffer = self.to_le_bytes();
        4
    }
    
    fn decode(buffer: &[u8; 4]) -> (Self, usize) {
        (u32::from_le_bytes(*buffer), 4)
    }
}

struct MessageBuilder<const SIZE: usize> {
    buffer: [u8; SIZE],
    position: usize,
}

impl<const SIZE: usize> MessageBuilder<SIZE> {
    fn new() -> Self {
        Self {
            buffer: [0; SIZE],
            position: 0,
        }
    }
    
    fn add_field<T, const FIELD_SIZE: usize>(&mut self, value: &T) -> &mut Self 
    where T: FieldEncoder<FIELD_SIZE> {
        let mut field_buffer = [0u8; FIELD_SIZE];
        let size = value.encode(&mut field_buffer);
        
        self.buffer[self.position..self.position + size]
            .copy_from_slice(&field_buffer[..size]);
        self.position += size;
        
        self
    }
    
    fn finish(self) -> [u8; SIZE] {
        self.buffer
    }
}

The const generic approach allows the compiler to generate specialized code for each field type and size combination. This eliminates the need for runtime type checking and enables the compiler to inline all operations, resulting in optimal performance.

Vectorized data transformation takes advantage of CPU SIMD instructions to process multiple values simultaneously. When dealing with large arrays of price data, SIMD operations can provide significant performance improvements.

use std::arch::x86_64::*;

unsafe fn convert_prices_to_fixed_point(
    prices: &[f64], 
    output: &mut [i64]
) {
    assert_eq!(prices.len(), output.len());
    
    let scale = _mm256_set1_pd(1_000_000.0);
    let chunks = prices.len() / 4;
    
    for i in 0..chunks {
        let base_idx = i * 4;
        
        let prices_vec = _mm256_loadu_pd(prices[base_idx..].as_ptr());
        let scaled = _mm256_mul_pd(prices_vec, scale);
        let int_vec = _mm256_cvtpd_epi64(scaled);
        
        _mm256_storeu_si256(
            output[base_idx..].as_mut_ptr() as *mut __m256i,
            int_vec
        );
    }
    
    for i in (chunks * 4)..prices.len() {
        output[i] = (prices[i] * 1_000_000.0) as i64;
    }
}

This implementation processes four double-precision values simultaneously using AVX2 instructions. The vectorized approach reduces the number of CPU cycles required for price conversions, particularly beneficial when processing market data feeds with thousands of price updates per second.

Memory-mapped message queues provide persistent, zero-copy communication between processes. This technique eliminates the overhead of system calls and memory copying that traditional IPC mechanisms introduce.

use std::sync::atomic::{AtomicUsize, Ordering};

struct MmapMessageQueue {
    mmap: memmap2::MmapMut,
    header: *mut QueueHeader,
    data_start: *mut u8,
    capacity: usize,
}

#[repr(C)]
struct QueueHeader {
    write_offset: AtomicUsize,
    read_offset: AtomicUsize,
    capacity: usize,
    magic: u64,
}

impl MmapMessageQueue {
    fn new(path: &str, size: usize) -> std::io::Result<Self> {
        let file = std::fs::OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(path)?;
            
        file.set_len(size as u64)?;
        
        let mmap = unsafe { memmap2::MmapOptions::new().map_mut(&file)? };
        
        let header = mmap.as_ptr() as *mut QueueHeader;
        let data_start = unsafe { header.add(1) as *mut u8 };
        let data_capacity = size - std::mem::size_of::<QueueHeader>();
        
        unsafe {
            if (*header).magic != 0xDEADBEEF {
                (*header).write_offset = AtomicUsize::new(0);
                (*header).read_offset = AtomicUsize::new(0);
                (*header).capacity = data_capacity;
                (*header).magic = 0xDEADBEEF;
            }
        }
        
        Ok(Self {
            mmap,
            header,
            data_start,
            capacity: data_capacity,
        })
    }
    
    fn write_message(&self, data: &[u8]) -> bool {
        let message_size = data.len() + 4;
        
        unsafe {
            let write_pos = (*self.header).write_offset.load(Ordering::Relaxed);
            let read_pos = (*self.header).read_offset.load(Ordering::Acquire);
            
            let available = if write_pos >= read_pos {
                self.capacity - write_pos + read_pos
            } else {
                read_pos - write_pos
            };
            
            if available < message_size {
                return false;
            }
            
            let len_bytes = (data.len() as u32).to_le_bytes();
            std::ptr::copy_nonoverlapping(
                len_bytes.as_ptr(),
                self.data_start.add(write_pos),
                4
            );
            
            std::ptr::copy_nonoverlapping(
                data.as_ptr(),
                self.data_start.add(write_pos + 4),
                data.len()
            );
            
            let next_write = (write_pos + message_size) % self.capacity;
            (*self.header).write_offset.store(next_write, Ordering::Release);
        }
        
        true
    }
}

The memory-mapped approach provides several advantages: persistence across process restarts, zero-copy data transfer, and reduced system call overhead. The atomic operations ensure thread safety while maintaining high performance.

Hardware timestamp integration captures precise timing information using CPU time stamp counters. Accurate timestamps are essential for regulatory compliance and performance measurement in trading systems.

struct HardwareTimer {
    frequency: u64,
    start_time: u64,
}

impl HardwareTimer {
    fn new() -> Self {
        let frequency = Self::measure_frequency();
        
        Self {
            frequency,
            start_time: Self::rdtsc(),
        }
    }
    
    fn rdtsc() -> u64 {
        unsafe {
            let mut low: u32;
            let mut high: u32;
            std::arch::asm!(
                "rdtsc",
                out("eax") low,
                out("edx") high,
            );
            ((high as u64) << 32) | (low as u64)
        }
    }
    
    fn measure_frequency() -> u64 {
        let start = Self::rdtsc();
        std::thread::sleep(std::time::Duration::from_millis(100));
        let end = Self::rdtsc();
        (end - start) * 10
    }
    
    fn timestamp_ns(&self) -> u64 {
        let cycles = Self::rdtsc() - self.start_time;
        cycles * 1_000_000_000 / self.frequency
    }
}

#[repr(C)]
struct TimestampedMessage {
    hardware_timestamp: u64,
    sequence: u64,
    message_data: [u8; 56],
}

impl TimestampedMessage {
    fn new(data: &[u8], timer: &HardwareTimer) -> Self {
        let mut message_data = [0u8; 56];
        let copy_len = data.len().min(56);
        message_data[..copy_len].copy_from_slice(&data[..copy_len]);
        
        Self {
            hardware_timestamp: timer.timestamp_ns(),
            sequence: 0,
            message_data,
        }
    }
}

The RDTSC instruction provides sub-nanosecond precision timestamps with minimal overhead. Calibrating the TSC frequency ensures accurate time measurements across different hardware platforms.

Custom compact binary protocols offer superior performance compared to general-purpose serialization libraries. By designing protocols specifically for trading message types, we can achieve optimal encoding density and parsing speed.

struct CompactSerializer {
    buffer: Vec<u8>,
}

impl CompactSerializer {
    fn new() -> Self {
        Self { buffer: Vec::with_capacity(1024) }
    }
    
    fn write_varint(&mut self, mut value: u64) {
        while value >= 0x80 {
            self.buffer.push((value as u8) | 0x80);
            value >>= 7;
        }
        self.buffer.push(value as u8);
    }
    
    fn write_fixed64(&mut self, value: u64) {
        self.buffer.extend_from_slice(&value.to_le_bytes());
    }
    
    fn write_string(&mut self, s: &str) {
        self.write_varint(s.len() as u64);
        self.buffer.extend_from_slice(s.as_bytes());
    }
    
    fn serialize_order(&mut self, order: &TradingOrder) -> &[u8] {
        self.buffer.clear();
        
        self.buffer.push(1 << 3 | 1);
        self.write_fixed64(order.id);
        
        self.buffer.push(2 << 3 | 2);
        self.write_string(&order.symbol);
        
        self.buffer.push(3 << 3 | 1);
        self.write_fixed64(order.price.value as u64);
        
        self.buffer.push(4 << 3 | 0);
        self.write_varint(order.quantity as u64);
        
        &self.buffer
    }
}

struct CompactDeserializer<'a> {
    data: &'a [u8],
    position: usize,
}

impl<'a> CompactDeserializer<'a> {
    fn new(data: &'a [u8]) -> Self {
        Self { data, position: 0 }
    }
    
    fn read_varint(&mut self) -> Option<u64> {
        let mut result = 0u64;
        let mut shift = 0;
        
        while self.position < self.data.len() {
            let byte = self.data[self.position];
            self.position += 1;
            
            result |= ((byte & 0x7F) as u64) << shift;
            
            if byte & 0x80 == 0 {
                return Some(result);
            }
            
            shift += 7;
            if shift >= 64 {
                return None;
            }
        }
        
        None
    }
    
    fn read_fixed64(&mut self) -> Option<u64> {
        if self.position + 8 > self.data.len() {
            return None;
        }
        
        let bytes = [
            self.data[self.position],
            self.data[self.position + 1],
            self.data[self.position + 2],
            self.data[self.position + 3],
            self.data[self.position + 4],
            self.data[self.position + 5],
            self.data[self.position + 6],
            self.data[self.position + 7],
        ];
        
        self.position += 8;
        Some(u64::from_le_bytes(bytes))
    }
}

struct TradingOrder {
    id: u64,
    symbol: String,
    price: FixedPrice,
    quantity: u32,
}

This custom protocol uses variable-length encoding for integers and fixed-length encoding for high-precision values. The protocol design prioritizes parsing speed over space efficiency, making it ideal for high-frequency trading applications where latency matters more than bandwidth.

These eight techniques work together to create a comprehensive zero-copy serialization system. Memory alignment ensures optimal cache performance, fixed-point arithmetic eliminates floating-point overhead, and lock-free data structures minimize synchronization costs. SIMD operations accelerate bulk transformations, while memory mapping provides efficient inter-process communication.

Hardware timestamp integration enables precise latency measurement and regulatory compliance. Custom protocols optimize for specific message types and usage patterns. Combined, these techniques can reduce serialization overhead from microseconds to nanoseconds, making the difference between profitable and unprofitable trading strategies.

The key insight is that high-frequency trading systems require domain-specific optimizations rather than general-purpose solutions. Each technique addresses specific performance bottlenecks that traditional serialization libraries cannot eliminate. By implementing these optimizations in Rust, we gain both the performance benefits and memory safety guarantees essential for production trading systems.