Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> That is you write fewer hardware blocks to the flash chips with a well tuned rocksdb.

What would you consider a well-tuned rocksdb? My understanding is that, due to level-based compaction, there is always a decent amount of write amplification that is unavoidable -- i.e. for one modification to eventually end up in the bottommost level (e.g. L6), it would need to be (re-)written to disk 5 or 6 times. That's quite heavy an amount of write amplification, but maybe my expectations are off?



Tuning RocksDB well is a very very hard challenge, and one that I am happy to not do day to day anymore. RocksDB is very powerful but it comes with other very sharp edges. Compaction is one of those, and all answers are likely workload dependent.

If you are worried about write amplification then leveled compactions are sub-optimal. I would try the universal compaction.

- https://github.com/facebook/rocksdb/wiki/Universal-Compactio...


All those writes happen in big blocks with other updates, so the number of writes per update is still relatively low. Compare that to e.g. postgres where a single update causes quite a few page writes that probably don't get batched up (add new row version in heap, update old row version in heap, maybe update indexes, vacuum old tuple in heap, maybe vacuum indexes).


A single durable update is the worst case for a classic B-tree with no WAL, though, and where the classic LSM-tree has the advantage. You can't use it to show a consistent write advantage to LSM-trees, except in applications that only do those kinds of updates.

If you write (say) 1000 small, contiguous rows in a single transaction, the B-tree wins. The B-tree writes a few full leaf pages and then updates just enough interior pages to point to them. The LSM-tree writes the rows to the first level, then later will write them again to the second level, then to the third layer, then... N times for N levels. The data also has to be read N-1 times.

At (say) 100,000 contiguous rows in a single transaction, the classic LSM-tree is still poor, but a modern LSM-tree with multiple files in the bulk level, with special handling of bulk loads to bypass the levels and write directly to the logical-middle in the bulk level, is similar I/O performance to the B-tree, so it catches up. I think RocksDB has this but you have to request it explicitly, rather than it detecting when to do so automatically. Probably with lower layout discontiguity if it works by writing to files or large contiguous zones. But if we are comparing LSM-tree with modification for bulk loads, we may as well compare with B-tree with a similar modification, which can also use a zone strategy to reduce discontiguity on bulk loads.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: