Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A bit of a tangent, but HNers often have the kind of hands-on experience that's hard to find in internet searches, so I'll ask away :)

A long time ago we had a big MySQL tokudb db and were keen to migrate to myrocks. But myrocks put every table into a single big file, rather than a file per partition.

The partition-per-file is a big deal if you are retaining N days of data in a DB and every night will be dropping some old day. If your DB stores each partition in separate files, the DB can simply delete them. But if your DB stores all the partitions in a single file, then it will end up having to compact your absolutely massive big dataset. It was completely unworkable for us.

Has this changed?



Hey Will! Joran from TigerBeetle here.

Partitioning data across files (or LSM trees) can be a remarkable win. For data retention policies, as well as for exploiting immutability in different workloads to reduce write amplification.

For example, in TigerBeetle, a DB that provides double-entry financial accounting primitives, our secondary indexes mutate, but half of our ingest volume, all the transactions themselves are immutable, and inserted in chronological order.

We therefore designed our local storage engine as an LSM-forest, putting different key/value types in their own tree, so that mutable data wouldn't compact immutable data. This turns our object tree for primary keys into essentially an append-only log.

I did a lightning talk on this, and a few of our other LSM optimizations, at Jamie Brandon's HYTRADBOI conference last year: https://www.youtube.com/watch?v=yBBpUMR8dHw

RocksDB also allows you to do this, with its concept of column families, if I am not mistaken. However, we wanted more memory efficiency with static memory allocation, deterministic execution and deterministic on disk storage for faster testing (think FoundationDB's simulator but with storage fault injection) and faster recovery (thanks to smaller diffs, with less randomness in the data files being recovered), and also an engine that could solve our storage fault model.

All details in the talk. Or ping me if you have questions.


something doesn't make sense here - MySQL/InnoDB does put tables into files, but partitions get separate file.

MyRocks has a collection of files per each column family, and when you drop data it can quickly expunge files that don't contain data for other tables/partitions - and trigger compaction on neighbors, if needed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: