Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Actually, I'm curious. What are best practices for setting up redundancy at the load balancer level? Options I've seen are:

- Hot standby that detects when the master is down and takes over as master when needed

- DNS-level solutions that distribute across multiple load balancers

But the DNS has a TTL that may not be honored by all ISPs, so how do you create a truly no single point of failure with the load balancer?



DNS-level solutions are oftentimes misunderstood, probably because there are a couple of things that can be done at this level:

1. "Geo-DNS" is about using an anycast network to direct users to their nearest datacenter(s). This does _not_ aid High Availability at all.

2. DNS Round Robin is about distributing the load between multiple IPs. As a load balancing solution it is relatively poor, because you have no control over the actual balancing and can end up receiving most users though a single IP.

3. DNS Failover solutions that replace the IP when the server goes down, which is also a poor solution because of TTL and non-TTL browser caches.

4. DNS Round Robin but for the High Availability, not for the balancing. This is actually an interesting approach because most modern browsers automatically switch to using (one of) the other record(s) when the IP they were using goes down (sorry, I have no reference that clearly states which browser do and which ones don't exhibit this behavior). In fact, there are some sources around [1] that seem to identify this approach as the only one to achieve instant failover in the face of datacenter-wide outages.

[1] http://serverfault.com/questions/69870/multiple-data-centers...


I have experience using approach #4 for global high availability, and the main downside is that most network outages result in packet loss rather than returning an immediate ICMP error, so browsers will hang for about 30 seconds before timing out and trying the next A record.

My personal preference is to combine approaches #3 and #4: have all hosts in the round robin with a lowish TTL, but automatically remove any host that goes down.


1. If you can pull your routes, it sure does.


I'm using two HAProxy machines and keepalived to ensure that if one goes down, the other takes over.

Because HTTP in general and with that HAProxy is very stateless, this is really easy and safe to do. I've had the opportunity to fail over multiple times already (mainly for software updates, once because of a hardware issue) and I never had any problems with the setup.


This sounds like a split-brain disaster waiting to happen, when a network partition between the two load balancers makes both think that the other one is down.


This is why you only ever automatically fail over in one direction, and (ideally) can STONITH.


I was running a similar system. What solution would you propose?


I'm no expert, but I think the main idea is to have an odd number of nodes plus a consensus protocol like Paxos or Raft to figure out who's actually alive. If a node finds itself in a split minority, it won't become a master. This doesn't work when you have only two nodes, as there's no minority after the split, both sides are equal. Building this from scratch is extremely difficult and error-prone, so people usually use systems like ZooKeeper.


Your first option. A VIP with a heartbeat failover setup. DNS is useful when you have multiple end points and need to push more BW than a single LB/box can handle. But for anything under ~5 Gbit/sec, the Active-Hot Standby setup works great.


Out of interest, how would you do this on a VPS like Digital Ocean, where IPs cannot be reassigned?


You'd need some place that allows for VIPs I think... I've never used Digital Ocean. Sorry.


The hot-standby approach, using VRRP to manage a floating address between the two boxes does work and fails over quickly ( unnoticeably to most clients ) but getting it set up correctly is a project.


For a database it would be not having a load balancer at all and performing connection pooling / failover at the client.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: