October 9, 2024 RustCLIPerformanceSystems Programming

Rust for System Tools: Building Cleanser

When I decided to build a macOS storage cleanup tool, I knew I needed something fast, safe, and reliable. Rust was the obvious choice, and the journey taught me a lot about systems programming and performance optimization.

Why Rust for CLI Tools?

Memory Safety Without Garbage Collection

System tools often need to process large amounts of data efficiently. Rust gives you: - Zero-cost abstractions - Memory safety without runtime overhead - Predictable performance characteristics

Excellent Concurrency Primitives

For a tool that scans filesystems, parallel processing is essential:

use rayon::prelude::*;

fn scan_directories(paths: Vec<PathBuf>) -> Vec<FileInfo> {
    paths
        .par_iter()
        .flat_map(|path| scan_directory(path))
        .collect()
}

Cross-Platform by Default

Rust's standard library abstracts platform differences well, making cross-platform development straightforward.

Architecture Decisions

Parallel File Scanning

The biggest performance win came from parallelizing filesystem operations:

use rayon::ThreadPoolBuilder;

pub struct Scanner {
    thread_pool: ThreadPool,
    max_depth: usize,
}

impl Scanner {
    pub fn new(threads: usize, max_depth: usize) -> Self {
        let thread_pool = ThreadPoolBuilder::new()
            .num_threads(threads)
            .build()
            .expect("Failed to create thread pool");
            
        Self { thread_pool, max_depth }
    }

    pub fn scan(&self, root: &Path) -> ScanResult {
        self.thread_pool.install(|| {
            self.scan_recursive(root, 0)
        })
    }
}

Smart Caching System

To avoid rescanning unchanged directories:

use std::collections::HashMap;
use std::time::SystemTime;

#[derive(Debug, Clone)]
struct CacheEntry {
    last_modified: SystemTime,
    file_count: usize,
    total_size: u64,
}

pub struct ScanCache {
    entries: HashMap<PathBuf, CacheEntry>,
}

impl ScanCache {
    pub fn is_valid(&self, path: &Path) -> bool {
        if let Some(entry) = self.entries.get(path) {
            if let Ok(metadata) = path.metadata() {
                if let Ok(modified) = metadata.modified() {
                    return modified <= entry.last_modified;
                }
            }
        }
        false
    }
}

Risk-Based Cleanup Levels

Different users have different risk tolerances:

#[derive(Debug, Clone, Copy)]
pub enum CleanupLevel {
    Safe,      // Only obvious temp files
    Moderate,  // Include build artifacts
    Risky,     // Include caches that might slow things down
}

impl CleanupLevel {
    pub fn should_clean(&self, file_type: &FileType) -> bool {
        match (self, file_type) {
            (_, FileType::TempFile) => true,
            (CleanupLevel::Moderate | CleanupLevel::Risky, FileType::BuildArtifact) => true,
            (CleanupLevel::Risky, FileType::Cache) => true,
            _ => false,
        }
    }
}

Performance Optimizations

SHA-256 Hashing for Duplicates

Finding duplicate files efficiently:

use sha2::{Sha256, Digest};
use std::collections::HashMap;

pub fn find_duplicates(files: &[PathBuf]) -> HashMap<String, Vec<PathBuf>> {
    files
        .par_iter()
        .filter_map(|path| {
            let hash = compute_file_hash(path).ok()?;
            Some((hash, path.clone()))
        })
        .collect::<Vec<_>>()
        .into_iter()
        .fold(HashMap::new(), |mut acc, (hash, path)| {
            acc.entry(hash).or_insert_with(Vec::new).push(path);
            acc
        })
        .into_iter()
        .filter(|(_, paths)| paths.len() > 1)
        .collect()
}

fn compute_file_hash(path: &Path) -> Result<String, Box<dyn std::error::Error>> {
    let mut file = File::open(path)?;
    let mut hasher = Sha256::new();
    std::io::copy(&mut file, &mut hasher)?;
    Ok(format!("{:x}", hasher.finalize()))
}

Memory-Efficient File Processing

For large files, stream processing instead of loading everything into memory:

use std::io::{BufReader, Read};

fn process_large_file(path: &Path) -> Result<ProcessResult, Box<dyn std::error::Error>> {
    let file = File::open(path)?;
    let mut reader = BufReader::new(file);
    let mut buffer = [0; 8192]; // 8KB buffer
    
    let mut total_size = 0;
    loop {
        let bytes_read = reader.read(&mut buffer)?;
        if bytes_read == 0 {
            break;
        }
        total_size += bytes_read;
        // Process chunk without storing entire file
    }
    
    Ok(ProcessResult { size: total_size })
}

Safety Features

Validation Before Deletion

Never delete without confirmation:

pub struct SafeDeleter {
    dry_run: bool,
    require_confirmation: bool,
}

impl SafeDeleter {
    pub fn delete_files(&self, files: &[PathBuf]) -> Result<DeletionResult, DeletionError> {
        // Validate all files exist and are safe to delete
        self.validate_files(files)?;
        
        if self.require_confirmation {
            self.prompt_for_confirmation(files)?;
        }
        
        if self.dry_run {
            return Ok(DeletionResult::dry_run(files));
        }
        
        // Actually delete files
        self.perform_deletion(files)
    }
    
    fn validate_files(&self, files: &[PathBuf]) -> Result<(), DeletionError> {
        for file in files {
            if self.is_system_critical(file) {
                return Err(DeletionError::SystemCritical(file.clone()));
            }
        }
        Ok(())
    }
}

Distribution Strategy

Homebrew Integration

Making installation easy for macOS users:

# Formula/cleanser.rb
class Cleanser < Formula
  desc "High-performance macOS storage cleanup tool"
  homepage "https://github.com/phpfc/cleanser"
  url "https://github.com/phpfc/cleanser/archive/v1.0.0.tar.gz"
  sha256 "..."
  
  depends_on "rust" => :build
  
  def install
    system "cargo", "install", *std_cargo_args
  end
  
  test do
    system "#{bin}/cleanser", "--version"
  end
end

CI/CD Pipeline

Automated testing and releases:

name: Release
on:
  push:
    tags: ['v*']

jobs:
  build:
    runs-on: macos-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions-rs/toolchain@v1
        with:
          toolchain: stable
      - name: Build release
        run: cargo build --release
      - name: Run tests
        run: cargo test
      - name: Create release
        uses: actions/create-release@v1

Lessons Learned

1. Start Simple, Optimize Later

My first version was single-threaded and slow. Profiling showed where the bottlenecks were, then I optimized those specific areas.

2. Error Handling is Critical

System tools need robust error handling. Rust's Result type makes this natural:

fn scan_directory(path: &Path) -> Result<Vec<FileInfo>, ScanError> {
    let entries = fs::read_dir(path)
        .map_err(|e| ScanError::ReadDir { path: path.to_owned(), source: e })?;
    
    // Process entries...
}

3. User Experience Matters

Even CLI tools need good UX. Clear progress indicators, helpful error messages, and sensible defaults make all the difference.

4. Performance Testing is Essential

I used criterion for benchmarking and flamegraph for profiling. Measuring performance objectively was crucial for optimization decisions.

Results

The final tool achieved:
- 10x faster than equivalent Python scripts
- Memory usage under 50MB even for large scans
- Zero crashes in production use
- 1000+ downloads via Homebrew

Conclusion

Rust proved to be an excellent choice for building system tools. The combination of performance, safety, and excellent tooling made development productive and the result reliable.

The key takeaways:
1. Leverage Rust's concurrency primitives early
2. Design for safety from the beginning
3. Profile and optimize based on real usage
4. Invest in good distribution and CI/CD

Building Cleanser taught me that Rust isn't just for systems programming - it's great for any tool where performance and reliability matter.