Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: I'm building a pure-Rust reimplementation of rsync (Protocol 32) (github.com/oferchen)
3 points by oferchen 55 days ago | hide | past | favorite | 2 comments
Years ago, I was tasked with a massive data migration: multiple disks, each containing over 100 million files, with a strict, non-negotiable 24-hour downtime window. Using the standard tools available at the time was a painful experience. The single-threaded file discovery crawled, and memory usage was a constant source of anxiety. I promised myself that one day, I would come back and build a tool that could actually handle that scale natively.

What started as a side project has evolved into a full systems-level undertaking. The project is oc-rsync - a complete client, server, and daemon implementation targeting rsync protocol 32, written entirely in pure Rust.

I find it incredibly ironic that I am currently shipping a data migration tool while my life is packed in suitcases, as I'm migrating to another country myself. I’ve been pushing git commits multiple times a day between packing boxes.

I want to be completely transparent upfront: I am actively working on this, and not everything is functional yet. The core delta-transfer, protocol interoperability (protocols 28-32), and daemon modes are solid, but I am still mapping out the hundreds of obscure flags and edge-cases that upstream rsync handles.

Rebuilding a codebase shaped by over 29 years of optimization required a modular approach (the workspace is currently split across 23 crates). A primary engineering goal was strict wire-compatibility with upstream while modernizing the internals for maximum throughput:

Pipelined Parallelism: I used Rayon to decouple filesystem traversal from data transfer. Parallelizing file list generation and checksum computation eliminates the infamous "scanning stall" on massive directories.

Modern I/O & Zero-Copy: The engine implements io_uring (Linux 5.6+) for batched async I/O with automatic fallbacks, alongside zero-copy copy_file_range and memory-mapped I/O.

SIMD & AES-NI: I replaced the standard C FFI calls with native Rust implementations. Checksums use runtime CPU feature detection (AVX2/NEON) to accelerate the rolling hash. Because standard SSH interactions simply weren't fast enough to keep up with the I/O pipeline, I offloaded the cryptography directly to hardware-accelerated AES-NI.

Memory Efficiency: Moved away from legacy sorted arrays to O(1) hash-based logic for metadata comparisons, and wired up the mimalloc allocator to keep the memory profile predictable.

I won't commit to specific "X times faster" claims here, as performance is highly dependent on your hardware and file distribution. However, under heavy transfer workloads, this architecture consistently achieves better or equal results compared to traditional builds, with significantly reduced CPU utilization. There's no need to set up benchmark scripts yourself to verify this—my CI pipeline benchmarks every single release automatically and posts a picture of the results directly to the README.

If you are interested in systems programming, kernel bypass I/O, or Rust workspace architecture, I'd love for you to take a look at the code. Let me know what you think of the architecture, or if you spot any glaring filesystem edge cases I should add to my CI harness.

#rust #Rust_Israel #ראסט_ישראל #rsync #systemsprogramming #performance #simd #aes #aes-ni



Why don't you work on improving rsync, rather than reinventing the wheel? Or create something new??

It's fine as a personal project, but as soon as you get other people using your new code, they'll be exposed to all the bugs that you are inevitably creating.

Honestly, this kind of "rewrite something battle tested in my favourite language" project is dangerous and insane.


I understand the apprehension. It’s completely fair to be wary of tossing out decades of battle-tested reliability, and "Rewrite It In X" syndrome is definitely a real anti-pattern in our industry. However, calling this project "dangerous and insane" ignores some very real, systemic limitations of the current rsync implementation that simply cannot be fixed without a ground-up rewrite.

Here is why creating a modern alternative is a necessary step forward, rather than just reinventing the wheel:

- Memory Safety and Security: rsync is written in C. While the maintainers do an incredible job, a memory-unsafe language will inevitably suffer from security vulnerabilities. We have seen periodic, significant CVEs related to buffer overflows and out-of-bounds memory accesses. Moving to a modern, memory-safe language structurally eliminates entire classes of these vulnerabilities.

- Architectural Bottlenecks: rsync was designed for 1990s hardware. While it uses a multi-process pipeline (generator, sender, receiver), its core file-processing loop is fundamentally serial. It was optimized for an era of low RAM and spinning disks. Today, we have NVMe drives and processors with dozens of cores. Modernizing the architecture with true multithreading and asynchronous I/O allows for massive performance gains that rsync's legacy architecture simply cannot accommodate.

- Technical Debt and Spaghetti Code: The original codebase is turning 30 years old this year (first released in 1996). Over three decades, it has accumulated a massive amount of technical debt. It is poorly documented internally, heavily patched, and relies on capturing thousands of obscure edge cases directly in the code logic rather than through clean abstractions. It has become a black box that is incredibly hostile to new contributors.

- Protocol Documentation: Because so much of the tool's behavior is implicitly defined by the code itself, the actual modern network protocol lacks comprehensive, standalone documentation. A major goal of this rewrite is to hunt down those undocumented edge cases and finally establish a clear, documented standard for the protocol.

- The Reality of Bugs: Yes, writing new code introduces new bugs. But if the fear of new bugs prevented us from writing new software, we would all still be using FTP and Telnet. Bugs are an inherent part of the development and iteration process. We write tests, we run betas, and we fix them.

The goal isn't to force everyone to switch tomorrow. It’s to build a modern, fast, and safe foundation so that when people are ready for an alternative, a robust one exists.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: