A bit of a tangent, but HNers often have the kind of hands-on experience that's ...

jorangreef · on April 20, 2023

Hey Will! Joran from TigerBeetle here.

Partitioning data across files (or LSM trees) can be a remarkable win. For data retention policies, as well as for exploiting immutability in different workloads to reduce write amplification.

For example, in TigerBeetle, a DB that provides double-entry financial accounting primitives, our secondary indexes mutate, but half of our ingest volume, all the transactions themselves are immutable, and inserted in chronological order.

We therefore designed our local storage engine as an LSM-forest, putting different key/value types in their own tree, so that mutable data wouldn't compact immutable data. This turns our object tree for primary keys into essentially an append-only log.

I did a lightning talk on this, and a few of our other LSM optimizations, at Jamie Brandon's HYTRADBOI conference last year: https://www.youtube.com/watch?v=yBBpUMR8dHw

RocksDB also allows you to do this, with its concept of column families, if I am not mistaken. However, we wanted more memory efficiency with static memory allocation, deterministic execution and deterministic on disk storage for faster testing (think FoundationDB's simulator but with storage fault injection) and faster recovery (thanks to smaller diffs, with less randomness in the data files being recovered), and also an engine that could solve our storage fault model.

All details in the talk. Or ping me if you have questions.

midom · on April 20, 2023

something doesn't make sense here - MySQL/InnoDB does put tables into files, but partitions get separate file.

MyRocks has a collection of files per each column family, and when you drop data it can quickly expunge files that don't contain data for other tables/partitions - and trigger compaction on neighbors, if needed.