As someone who has spent years working with data in various programming languages, I’ve come to appreciate the unique strengths that Rust brings to data science. Its emphasis on memory safety and performance makes it an ideal candidate for handling large-scale data tasks where errors can be costly and speed is essential. When I first explored Rust for data work, I was skeptical about moving away from Python’s rich ecosystem. However, the growing collection of Rust libraries has convinced me that it’s not just viable but often superior for many data science applications. In this article, I’ll walk through eight Rust libraries that have become staples in my toolkit, each offering robust solutions for different aspects of data science.
Polars stands out as my go-to library for data manipulation. It provides data frame operations that rival popular tools like pandas in Python, but with the added benefit of Rust’s efficient memory management. I’ve used it to process datasets that would have strained other systems, thanks to its lazy evaluation and parallel execution. For instance, when working with multi-gigabyte CSV files, Polars allows me to filter and transform data without loading everything into memory at once. Here’s a practical example from a recent project where I needed to clean and aggregate sales data. The code loads a CSV, selects specific columns, and applies a filter, all while minimizing resource usage.
use polars::prelude::*;
async fn process_large_dataset() -> Result<DataFrame, PolarsError> {
let df = LazyFrame::scan_csv("sales_data.csv", Default::default())?
.filter(col("amount").gt(1000))
.group_by(["region"])
.agg([col("amount").sum()])
.collect()?;
Ok(df)
}
This approach saved me hours of processing time compared to traditional methods. Polars also integrates smoothly with other data formats, like Parquet, which I often use for columnar storage. The ability to chain operations lazily means I can build complex pipelines and only execute them when necessary, reducing overhead and improving responsiveness in interactive analyses.
Ndarray is another library I rely on for numerical computing. It offers n-dimensional arrays that feel familiar if you’ve used NumPy, but with Rust’s compile-time checks to prevent common errors like shape mismatches. In one project, I used Ndarray to implement custom mathematical models for financial forecasting. The library’s slicing and broadcasting capabilities made it easy to work with high-dimensional data. Here’s a snippet where I performed element-wise operations and matrix multiplication, which are fundamental to many numerical tasks.
use ndarray::{Array2, Array3};
use ndarray::linalg::dot;
fn compute_correlation_matrix() -> Array2<f64> {
let data = Array2::from_shape_vec((3, 3), vec![1.0, 0.5, 0.2, 0.5, 1.0, 0.3, 0.2, 0.3, 1.0]).unwrap();
let weights = Array2::from_shape_vec((3, 1), vec![0.4, 0.3, 0.3]).unwrap();
let result = dot(&data, &weights);
result
}
I’ve found Ndarray particularly useful when paired with linear algebra crates for more advanced computations. Its performance in iterative algorithms, such as gradient descent, has helped me achieve results faster than in interpreted languages. The type safety ensures that I catch errors early, which is crucial when deploying models to production.
Linfa has become my preferred choice for machine learning tasks. It provides a comprehensive set of algorithms for classification, regression, and clustering, all designed with usability in mind. I appreciate its modular approach, which lets me swap components easily during experimentation. For example, when building a spam detection system, I used Linfa’s logistic regression implementation. The code below shows how straightforward it is to train a model and make predictions.
use linfa::Dataset;
use linfa::traits::{Fit, Predict};
use linfa_logistic::LogisticRegression;
use linfa::metrics::ToConfusionMatrix;
fn train_and_evaluate(features: Array2<f64>, labels: Vec<usize>) -> Result<(), linfa::error::Error> {
let dataset = Dataset::new(features, labels);
let (train, test) = dataset.split_with_ratio(0.8);
let model = LogisticRegression::default().fit(&train)?;
let predictions = model.predict(&test);
let cm = predictions.confusion_matrix(&test)?;
println!("Accuracy: {}", cm.accuracy());
Ok(())
}
This library has saved me from the pitfalls of overfitting and data leakage by encouraging best practices like proper train-test splits. I’ve also used its clustering algorithms for customer segmentation, where the performance gains from Rust’s parallelism were noticeable on large datasets.
Candle is a relatively new addition to my arsenal, but it has quickly proven its worth for deep learning. It offers GPU-accelerated tensor computations with a clean API, making it accessible without sacrificing power. I used Candle to build a neural network for image recognition, and the ability to run on CUDA-enabled devices cut training time significantly. Here’s a basic example of creating tensors and performing operations, which mirrors what you might do in PyTorch or TensorFlow.
use candle_core::{Device, Tensor, D};
use candle_nn::{Module, Optimizer};
fn simple_neural_net() -> Result<(), candle_core::Error> {
let device = Device::cuda_if_available(0)?;
let input = Tensor::randn(0f32, 1.0, (1, 10), &device)?;
let weight = Tensor::randn(0f32, 1.0, (10, 5), &device)?;
let bias = Tensor::zeros((5,), D::F32, &device)?;
let output = input.matmul(&weight)? + bias;
println!("Output shape: {:?}", output.shape());
Ok(())
}
What I like about Candle is its self-contained nature; it doesn’t rely on external deep learning frameworks, which simplifies deployment. In production environments, this has made it easier to maintain and scale models without dependency conflicts.
Tch-rs bridges the gap between Rust and PyTorch, allowing me to leverage existing Python models within Rust applications. This has been invaluable when migrating legacy systems or collaborating with teams that use PyTorch. I once integrated a pre-trained vision model into a Rust service for real-time inference, and Tch-rs made the process seamless. The code below demonstrates how to load a tensor and perform a simple operation, similar to PyTorch’s syntax.
use tch::{Device, Tensor, Kind};
fn run_pytorch_model() -> Tensor {
let t = Tensor::of_slice(&[1.0, 2.0, 3.0]).to_device(Device::Cuda(0));
let result = t * 2.0;
result
}
This library has helped me maintain performance while gradually transitioning codebases to Rust. The ability to use PyTorch’s autograd and optimizer implementations means I don’t have to rewrite everything from scratch, saving time and reducing errors.
SmartCore focuses on traditional machine learning algorithms with an emphasis on correctness and efficiency. I’ve used it for projects where interpretability is key, such as credit scoring models with decision trees. Its API is intuitive, making it easy to prototype and deploy. Here’s an example of training a linear regression model, which I’ve applied in demand forecasting.
use smartcore::linalg::naive::dense_matrix::DenseMatrix;
use smartcore::linear::linear_regression::LinearRegression;
use smartcore::metrics::mean_squared_error;
fn predict_sales(features: DenseMatrix<f64>, targets: Vec<f64>) -> Result<(), smartcore::error::Failed> {
let model = LinearRegression::fit(&features, &targets, Default::default())?;
let predictions = model.predict(&features)?;
let mse = mean_squared_error(&targets, &predictions);
println!("MSE: {}", mse);
Ok(())
}
SmartCore’s implementations are well-tested, which gives me confidence in production settings. I’ve found it especially useful for applications where latency matters, such as real-time recommendation systems.
Plotly-rs brings interactive visualization to Rust, enabling me to create charts and dashboards without switching to Python. I’ve used it to build internal tools for data exploration, where interactivity helps teams understand trends quickly. The library supports a wide range of plot types, from scatter plots to heatmaps. Here’s how I generated a line plot to visualize time series data.
use plotly::common::Mode;
use plotly::{Plot, Scatter};
use plotly::layout::Layout;
fn plot_time_series() {
let x = vec![1, 2, 3, 4, 5];
let y = vec![10, 11, 12, 13, 14];
let trace = Scatter::new(x, y)
.mode(Mode::Lines)
.name("Trend");
let layout = Layout::new().title("Sales Over Time".into());
let mut plot = Plot::new();
plot.add_trace(trace);
plot.set_layout(layout);
plot.show();
}
This has enhanced my workflow by keeping everything within Rust, reducing context switching. The plots are web-based, so they can be embedded in applications or shared easily.
Datafusion allows me to run SQL queries on in-memory data, which is perfect for ad-hoc analysis and building data pipelines. I’ve integrated it into ETL processes where SQL’s expressiveness simplifies complex transformations. For instance, in a recent log analysis project, I used Datafusion to filter and aggregate events efficiently. The async support is a bonus for non-blocking operations.
use datafusion::prelude::*;
use datafusion::arrow::record_batch::RecordBatch;
async fn analyze_logs() -> Result<Vec<RecordBatch>, datafusion::error::DataFusionError> {
let ctx = SessionContext::new();
ctx.register_csv("logs", "server_logs.csv", CsvReadOptions::new()).await?;
let df = ctx.sql("SELECT COUNT(*), status FROM logs WHERE timestamp > '2023-01-01' GROUP BY status").await?;
let results = df.collect().await?;
Ok(results)
}
Datafusion’s compatibility with Apache Arrow means I can exchange data with other systems seamlessly. This has been crucial in environments where data comes from multiple sources.
Using these libraries together has transformed how I approach data science in Rust. They cover the entire workflow, from data ingestion and cleaning to modeling and visualization. The performance benefits are tangible; I’ve seen reductions in processing time and memory usage compared to other languages. Moreover, Rust’s safety features have prevented many runtime errors that often plague data projects. As the ecosystem continues to mature, I expect even more tools to emerge, but these eight have already made Rust a compelling choice for data-intensive applications. Whether you’re just starting or looking to optimize existing systems, I recommend giving them a try—they might just change your perspective on what’s possible in data science.