What if you ingest 10s of millions of rows a day, and need to conditional update...

btilly · on Jan 27, 2021

What if you ingest 10s of millions of rows a day, and need to conditional updates based on those 10s of millions of rows?

Know your problem, then set things up appropriately.

I was addressing someone whose problem looked like a reporting query about a consumer application that was affecting the responsiveness of their consumer application. For that case separate the transactional and reporting database and tune each appropriately. That means, for example, different numbers of connections, different memory per connection, different amounts of temporary table space, and so on.

The problem that you're describing is much like the one I'm currently facing, ingesting time series from a busy factory floor, then doing various kinds of analytics. For that case, you can simply use one database, optimize reasonably, tune appropriately for the workload, and set appropriate expectations on responsiveness.

In all these cases you don't need a distributed system. Not unless your requirements are a lot harsher than what has been described so far.

mulmen · on Jan 27, 2021

My relational experience is largely in the analytics world. Can you give me a (simplified/obfuscated) example of this type of workload? Why do the rows need to be updated together or so frequently? My imagination is failing me but this sounds like an interesting problem.

barrkel · on Jan 28, 2021

Most recently I was chief architect at London-based fintech SaaS company that does reconciliations. Recs, as they're known, are in practice a big diff, or join (in the relational sense), between two sets of tabular data.

The clients are financial organizations, and the records are things like trades, positions, transactions of various kinds. Larger organizations perform millions of these a day, and for regulatory and risk reasons need to have controls in place such that they are accounted for in different systems and with different counterparties (trade repositories, brokers and the like).

So the system looks like this: it takes in lots of records for one side, lots of records for the other side, and compares the two. This produces 4 classes of output: fully matched data; partially matched data (some fields in the "join", like ISIN, are matches, but others are different - e.g. the price or quantity or fee is incorrect); and unmatched data on the left and right sides.

All this data needs to be written into a database and fetchable and queryable and sortable on demand by end users, by any field in the data, or by metadata associated with the records. This in itself is a challenge, since you can't index everything and you especially can't index across joins.

But there's more. Unmatched data may get matched by subsequently submitted data, and partially matched data may be updated by refreshed submitted data. Workflow metadata needs to be associated with outstanding items (unmatched, partially matched), and this workflow information has further details, like labels, comments, assigned users and groups, which in turn may be updated manually or automatically, with configurable rule-based automatic workflow classification and assignment.

Millions of records a day added, across lots of customers, adding up to billions of rows a day (multiple rows per record, naturally). Maybe 5% of inserted records will get meaningfully updated later, but 5% of a big number is still pretty big.

Ideally recs would have 100% match rate and not need any workflow metadata because there would be no problems to assign to anyone, nor label or comment on. But match quality (often a function of customer maturity and sophistication) is variable, and more importantly, input batch skew is very common. That is, maybe an org gets a batch every evening from its counterparty, but submits an up-to-the-minute snapshot of its own data. If the skew is bad, every rec might have 30% unmatched data which will mostly be fully matched after the next batch comes in the next day. This overhang (called 'carry over' internally) needs to be managed; if it gets too stale (timestamps, expiry), if it doesn't in fact get matched off, if it gets refreshed by new data, etc.

Oh, and you can't stick everything denormalized into a document database and build nice fat indexes, because the bigger customers are hyper-cautious and don't want data intermingling, and running separate redundant services for every customer is too expensive.

It is an interesting problem - or at least it was, but it's not my problem any more!

mulmen · on Jan 29, 2021

Thanks for taking the time to explain that. Food for thought.

barrkel · on Jan 29, 2021

MySQL was the chosen store, chosen primarily for its (async) replication story which was better than Postgres back in 2012. Though I'm not sure it wouldn't still be a better choice today with the update load, I've had problems in personal toy (~20 million row) projects with slow bulk update speed in Postgres.

The last thing I did before I left was build out a mechanism for mirroring a denormalized copy of recent data in ClickHouse, partitioned by time (for matched) and type (for partials / unmatched). CH, being analytics oriented, works much better for the ad-hoc end user queries, which filter and sort on a handful of arbitrary columns - ideal for a columnar store. Interactive updates can be handled by CH's versioned merge tree table engine, and batch updates by rewriting the relevant partition from the source of truth in MySQL.

I chose CH primarily because it scaled down well for a columnar store - much better than anything in the Hadoop space. The conservatism of the market meant you couldn't just throw random cloud provider data mechanisms at it, nor a whole lot of fancy big stuff.