> Shards increase the number of failure modes and increase the complexity of those failure modes.
I would only agree with this during the initial implementation of sharding. Once deployed and stable, I have not found this to be the case, at all.
I say this as someone who has directly architected a sharded database layer that scaled to over a trillion rows, and later worked on core automation and operations for sharded databases that scaled to an incalculable number of rows (easily over 1 quadrillion).
In both cases, each company's non-sharded databases were FAR more operationally problematic than the sharded ones. The sharded database tiers behave in common ways with relatively uniform workloads, and the non-sharded databases were each special snowflakes using different obscure features of the database.
> keep it simple, don't shard until you need
I would have wholeheartedly agreed with this until a few years ago. Cloud storage now permits many companies to run monster 10+ TB monolithic relational databases. Technically these companies no longer "need" to shard, possibly ever. But at these data sizes, many operations become extremely painful, problematic, and slow.
> In both cases, each company's non-sharded databases were FAR more operationally problematic than the sharded ones. The sharded database tiers behave in common ways with relatively uniform workloads, and the non-sharded databases were each special snowflakes using different obscure features of the database.
That's because sharded tables restrict what features you can use (e.g., no JOINs). If you constrained the features on the non-sharded databases, you'd achieve the same net result.
Sharding _necessarily_ only solves one problem: queries operate against only a subset of data. While what you're saying is true (sharding avoids certain problems) it also restricts your ability to perform other operations (more complicated queries or reports are ~impossible). It is not without its tradeoffs.
Sharded environments still have some JOINs. Typically all data for a single user/customer/whatever should be placed on a single shard, and it's still very useful to join across tables for a single user.
> If you constrained the features on the non-sharded databases, you'd achieve the same net result.
No, that's not the main reason for operational benefits. Rather, it's simply because the shards all have a uniform schema, uniform query workload, and relatively small data size (as compared to a large monolithic DB). You can perform operational maintenance at better granularity -- i.e. on a single shard rather than the entire logical data set. And if a complex operation is fine on one shard, it's very likely fine on the rest, due to the uniformity.
For example, performing a DBMS major version upgrade on a large monolithic DB is a nightmare. It's an all-or-nothing affair. If you encounter some unforeseen problem with the new DB version only in prod, you can expect some significant application-wide downtime. Meanwhile for a sharded environment it's both easier and safer from an operational perspective, assuming your team is comfortable automating the process once it has been proven safe. You can upgrade one shard and confirm no problems after a week in prod before proceeding to the rest of the fleet. If unforeseen problems do occur, worst-case it only impacts a tiny portion of users.
> it also restricts your ability to perform other operations (more complicated queries or reports are ~impossible). It is not without its tradeoffs.
Yes, this is why I said above "There are definitely major downsides to sharding, but they tend to be more on the application side in my experience." OP claimed the downsides were operational (e.g. more complex failures or larger downtime), which I do not agree with.
Yes. Yoshinori is among the best of the best, and I could not hope to capture that level of nuance in a quick HN comment :)
But in terms of being a more balanced view, my read of his post aligns pretty closely with my main point in this subthread: the disadvantages of sharding fall more on the application side (limitations on joins, secondary indexes, transactions) than on the operational side (availability, performance and resource management, logical backup, db cloning, replication).
I would only agree with this during the initial implementation of sharding. Once deployed and stable, I have not found this to be the case, at all.
I say this as someone who has directly architected a sharded database layer that scaled to over a trillion rows, and later worked on core automation and operations for sharded databases that scaled to an incalculable number of rows (easily over 1 quadrillion).
In both cases, each company's non-sharded databases were FAR more operationally problematic than the sharded ones. The sharded database tiers behave in common ways with relatively uniform workloads, and the non-sharded databases were each special snowflakes using different obscure features of the database.
> keep it simple, don't shard until you need
I would have wholeheartedly agreed with this until a few years ago. Cloud storage now permits many companies to run monster 10+ TB monolithic relational databases. Technically these companies no longer "need" to shard, possibly ever. But at these data sizes, many operations become extremely painful, problematic, and slow.