As far as failover is concerned, we want to absolutely make sure that impact of our downtime is minimal (idealy nil) on our customers’ websites. We can afford to be down but we cannot afford our customers’ websites to be down because of us. Our uptime has been around 99.8% (see our public uptime reports). We are specifically interested in finding out what is the worst that can possibly happen and try minimizing that. Our research led us to conclude that even for the best configurations in the world (yes, that includes Google, Amazon and all biggies), maximum theoretical downtime (for a few users) will be 30 minutes. You cannot possibly escape this limitation.
In high-availbility systems, failover is typically implemented by using a system called IP takeover using heartbeat. Essentially what this means is that two or more servers share a same public IP address and they monitor each other continuously. In case one goes down, the other servers take up the IP address and begins serving requests. To end user, all this is transparent and failure of one machine doesn’t really affect the service. The one key point here is that for IP takeover mechanism to work, all servers must be in same data-center (and perhaps on the same network switch?). That is, this strategy works if backup servers are physically close to each other.
However, suppose that your data-center suffers a major network and power outage (though uncommon, can still occur). IP takeover isn’t going to help in this case as all servers go down (being in same data-center). To prevent such a scenario, a technique called DNS failover [pdf link] is used. Whenever a user requests your website (www.example.com), browser requests DNS servers to determine which IP address corresponds to that URL. That is, browser contacts DNS servers to resolve www.example.com to an IP address of the main server. In DNS failover, your website is continuously monitored and if it is detected to be down, DNS servers are automatically updated to point www.example.com to a backup server (located in a geographically separated data center). So, all requests for www.example.com which earlier resolved to your main server start resolving to your backup server. This means that if you suffer major outage in our main data center, you can start serve requests from your backup server in a different data center.
However, this is an important caveat. DNS enteries have a Time-To-Live (TTL) values attached to them which tell browser for how long it should cache the response to DNS resolution query. That is, if your DNS entry TTL is 20 minutes, browser will make one DNS lookup for your website and cache the results so that it doesn’t have to lookup results for another 20 minutes again. The downside is that if your server goes down within those 20 minutes, the browser will not see the updated (backup server) IP address for the whole 20 minutes and visitors will see your servers as down. (New visitors will see the backup server as they will make a fresh DNS resolution query).
This can, of course, be solved by having a low TTL value. At Visual Website Optimizer, we are using a TTL value of 2 minutes so that in case of downtime DNS is updated within 2 minutes to reflect backup server availability. That is, within 2 minutes Visual Website Optimizers will appear to be up, even if there is a major outage/downtime. If typical response time is 2 minutes, why do we say theoretical downtime is 30 minutes? That’s because Internet Explorer caches DNS for 30 minutes irrespective of the TTL value. This means that no matter how low is your TTL value, Internet Explorer users who are using your website (at the time it goes down) will see servers remain down for at least 30 minutes. Note that all new users (including IE) will do a fresh DNS query and see your website as up. For more details on IE’s atypical behavior read this PDF document.
If you are a technical person reading this and think we are misguided in our research, please leave a comment or email us at firstname.lastname@example.org. If you have a better solution for reducing downtime impact, we will of course love to be wrong 🙂