It's a huge pet peeve of mine, but why do people present network maps of these h...

joshpadnick · on June 11, 2014

Actually, I'm curious. What are best practices for setting up redundancy at the load balancer level? Options I've seen are:

- Hot standby that detects when the master is down and takes over as master when needed

- DNS-level solutions that distribute across multiple load balancers

But the DNS has a TTL that may not be honored by all ISPs, so how do you create a truly no single point of failure with the load balancer?

kilburn · on June 11, 2014

DNS-level solutions are oftentimes misunderstood, probably because there are a couple of things that can be done at this level:

1. "Geo-DNS" is about using an anycast network to direct users to their nearest datacenter(s). This does _not_ aid High Availability at all.

2. DNS Round Robin is about distributing the load between multiple IPs. As a load balancing solution it is relatively poor, because you have no control over the actual balancing and can end up receiving most users though a single IP.

3. DNS Failover solutions that replace the IP when the server goes down, which is also a poor solution because of TTL and non-TTL browser caches.

4. DNS Round Robin but for the High Availability, not for the balancing. This is actually an interesting approach because most modern browsers automatically switch to using (one of) the other record(s) when the IP they were using goes down (sorry, I have no reference that clearly states which browser do and which ones don't exhibit this behavior). In fact, there are some sources around [1] that seem to identify this approach as the only one to achieve instant failover in the face of datacenter-wide outages.

[1] http://serverfault.com/questions/69870/multiple-data-centers...

agwa · on June 12, 2014

I have experience using approach #4 for global high availability, and the main downside is that most network outages result in packet loss rather than returning an immediate ICMP error, so browsers will hang for about 30 seconds before timing out and trying the next A record.

My personal preference is to combine approaches #3 and #4: have all hosts in the round robin with a lowish TTL, but automatically remove any host that goes down.

kev009 · on June 12, 2014

1. If you can pull your routes, it sure does.

pilif · on June 11, 2014

I'm using two HAProxy machines and keepalived to ensure that if one goes down, the other takes over.

Because HTTP in general and with that HAProxy is very stateless, this is really easy and safe to do. I've had the opportunity to fail over multiple times already (mainly for software updates, once because of a hardware issue) and I never had any problems with the setup.

laggyluke · on June 12, 2014

This sounds like a split-brain disaster waiting to happen, when a network partition between the two load balancers makes both think that the other one is down.

regularfry · on June 12, 2014

This is why you only ever automatically fail over in one direction, and (ideally) can STONITH.

annnnd · on June 12, 2014

I was running a similar system. What solution would you propose?

laggyluke · on June 12, 2014

I'm no expert, but I think the main idea is to have an odd number of nodes plus a consensus protocol like Paxos or Raft to figure out who's actually alive. If a node finds itself in a split minority, it won't become a master. This doesn't work when you have only two nodes, as there's no minority after the split, both sides are equal. Building this from scratch is extremely difficult and error-prone, so people usually use systems like ZooKeeper.

modoc · on June 11, 2014

Your first option. A VIP with a heartbeat failover setup. DNS is useful when you have multiple end points and need to push more BW than a single LB/box can handle. But for anything under ~5 Gbit/sec, the Active-Hot Standby setup works great.

atombender · on June 11, 2014

Out of interest, how would you do this on a VPS like Digital Ocean, where IPs cannot be reassigned?

modoc · on June 12, 2014

You'd need some place that allows for VIPs I think... I've never used Digital Ocean. Sorry.

olefoo · on June 12, 2014

The hot-standby approach, using VRRP to manage a floating address between the two boxes does work and fails over quickly ( unnoticeably to most clients ) but getting it set up correctly is a project.

rbranson · on June 12, 2014

For a database it would be not having a load balancer at all and performing connection pooling / failover at the client.

cbsmith · on June 11, 2014

Because load balancers in a diagram are generally a high availability pool of load balancers.

opendais · on June 11, 2014

Because everyone loves a Single Point of Failure? /s

It is probably because they don't provide the failover and assume you have a standby LB and an active LB. So only one needs to be shown and you handle the failover on the application side rather on their side. At least, that is what I generally see that sort of thing in.

mason_s · on June 12, 2014

Because they are not good at diagrams? Just emphasizing the key components of the architecture.

omnibrain · on June 11, 2014

This issue is discussed (not by me) a bit more in depth here: https://plus.google.com/+KristianK%C3%B6hntopp/posts/HtQB6hJ...

dragonwriter · on June 12, 2014

Because while the load balancer shown might (and often, in practice, would) actually be a multi-component system of its own, that's a concern outside the focus of the diagram that distracts from it is attempting to illustrate.