If it is just a backend, why not port it over to one of the myriad of cloud autoscaling solutions that are out there?
The opportunity cost of spending time figuring out why only 29 workers are receiving requests over adding new features that generate more revenue, seems like a quick decision.
Personally, I just start off with that now in the first place, the development load isn't any greater and the solutions that are out there are quite good.
Author here. We do and did use autoscaling heavily but at a certain scale we just ran out of headroom on the smaller instance types we were using. Jumping to a much larger instance types meant that we will likely never run into those headroom issues again, plus solves other problems like faster spin up, better sidecar connection pooling and allows for a much higher hit rate on per instance caching.
You were autoscaling a single threaded process. You had 1000 connections coming in and scaling 1000 workers for those connections. Everything was filtered through gunicorn and nginx, which just adds additional latencies and complexity, for no real benefit.
What I'm talking about is just pointing at something like AppEngine, Cloud Functions, etc... (or whatever solution AWS has that is similar) and being done with it. I'm talking about not running your own infrastructure, at all. Let AWS and Google be your devops so that you can focus on building features.
According to the article they have a monolithic Django application so this will have at least a couple of seconds start-up time. That is not a good match for Cloud Functions.
Django also has in-memory caches, for example for templates which can be extremely slow (seconds) and CPU intensive to render. So you really don't want to have AWS or Google restart your application on AppEngine whenever they feel like it.
There's a few reasons why this scenario wouldn't be a good fit for cloud functions, but that "couple of seconds start-up time" can be almost entirely removed from the equation by keeping the Django instance alive (all cloud function type offerings will have a concept of cold and warm starts, and some way to control persistence across calls on the same "instance").
I've run Django on AWS Lambda in a a scenario that scaled between 25-250 calls per second depending on time of day (for a runtime of 5-30 sec). Moving Django's bootstrapping so it would stay warm across calls was very easy.
> unless you have unfixable memory leaks there is no reason to do this.
It's also useful to set this threshold to prevent long-lived connections to services/datastores not used by every request from accumulating and consuming resources on those services.
a) you get to fire the devops person, which saves $150k+ a year.
b) you add appropriate caching layers in front of everything.
c) you spend time adding features, which generate revenue.
I've done all of this before at scale. This whole case study was written about work I did [1]. Two devs, 3 months to release, first year was $80m gross revenue on $500/month cloud bills. Infinite scalability, zero devops.
> you get to fire the devops person, which saves $150k+ a year.
You are deluded or extremely short-sighted if you believe you can actually fire the devops guy. From my experience, the more you stray away from the conventional "dedicated server" paradigm the more you need a devops guy and you are in a very precarious position if you do fire him and something goes wrong.
You don't hire the devops person until you've scaled to the point that you need one.
Additionally, your thought of having my company held hostage by a single devops person is terrifying. Now you need two of them, which is even more expensive.
It is a great way to bootstrap a company by saving on a salary (or two) that can honestly be engineered out for a lot of SASS businesses. It worked super well for us... and calling someone who did $80m in the first year deluded seems well, rude.
But, if you start off designing systems that scale on their own, you are much better prepared for when you do get some fast growth than dealing with hiring a good devops person (which is extremely hard, as they say.. all the good ones are taken).
At the end of the day, the actual elephant in the room is that django was the wrong choice. You end up having to go through a lot of contortions to make things work, as evidenced by the blog post. The architecture doesn't make things easy to spin up quickly... which creates a lot of bottlenecks. There are better cloud-based solutions.
If you don’t have a devops person, then you end up with developers pitching in to fill that void. That’s OK and may be desirable but it is still a cost.
They are on a back-end that does auto-scaling. They stated that they had problems when scaling up past 1000 nodes.
Now, maybe they could have fixed that issue instead, but going from 29 to 58 workers is easy, it's not the same going to 29,000 to 58,000. And 1000 hosts vs 500 is a non-trivial cost.
You'd probably be worrying more about instance sizes if you ran a single executor per container; the memory overhead of your app would become a problem very quickly unless it's startup footprint was quite small.
This doesn’t work so easily with architectures with process pools for workers. So now your app server needs to speak docker (or whatever control plane) to spawn new workers and deal with more complicated IPC. Also the startup time is brutal.
One process per container and multiprocessing is a huge lift most of the time. I’ve done it but it can be a mess because you don’t really have as much a handle on containers than subprocesses because you can only poke them at a distance through the control plane.
Do you mean multiprocessing inside the containers? Or are you managing multiprocessing child procs by forking into a container somehow? If the latter, I'd be really interested to learn how to do that; I didn't think it was possible, and it would be super useful for some of what I work on.
The opportunity cost of spending time figuring out why only 29 workers are receiving requests over adding new features that generate more revenue, seems like a quick decision.
Personally, I just start off with that now in the first place, the development load isn't any greater and the solutions that are out there are quite good.