Talk to a sales representative

+1 844-822-8378
or

Write to us

Check Your Email!
We've sent a message to
yourmail@domain.com
with an activation link in it.
Just click the link, and we'll take it from there.
Can't find the mail?
Check your spam, junk or secondary inboxes.
Still can't find it? Let us know at support@vwo.com

VWO BLOG

on Conversion Rate Optimization

Our two main priorities for Visual Website Optimizer are: a) extra load on customers’ website; b) impact on customer’s websites if our servers go down. These issues are super-important to us because unlike most other web applications, our design decisions not only impact our own website but hundreds of other (customer) websites which currently include our JavaScript for doing A/B split tests.

As far as speed is concerned, we did a series of benchmarks recently to determine the bottlenecks. It turned out that the slowest part of the process is 28K (gzipped) JavaScript file download. The file was hosted on Amazon S3 (because of this) and we have now decided to switch to a CDN (Content Delivery Network) for serving the file. Our benchmarks show Amazon CloudFront is about 10 times faster as compared to S3. The reason our JavaScript file is 28K is because it packs JQuery, which is needed for switching page content during A/B test (for technical folks: it is needed for interpreting CSS selectors). We recently experimented with CSSQuery library (which is much lighter) and it brought down the size of JavaScript file from 28K to 8k. However, that was just an experiment. Going forward, we plan to spend more energy in trimming the file size.

As far as failover is concerned, we want to absolutely make sure that impact of our downtime is minimal (idealy nil) on our customers’ websites. We can afford to be down but we cannot afford our customers’ websites to be down because of us. Our uptime has been around 99.8% (see our public uptime reports). We are specifically interested in finding out what is the worst that can possibly happen and try minimizing that. Our research led us to conclude that even for the best configurations in the world (yes, that includes Google, Amazon and all biggies), maximum theoretical downtime (for a few users) will be 30 minutes. You cannot possibly escape this limitation.

In high-availbility systems, failover is typically implemented by using a system called IP takeover using heartbeat. Essentially what this means is that two or more servers share a same public IP address and they monitor each other continuously. In case one goes down, the other servers take up the IP address and begins serving requests. To end user, all this is transparent and failure of one machine doesn’t really affect the service. The one key point here is that for IP takeover mechanism to work, all servers must be in same data-center (and perhaps on the same network switch?). That is, this strategy works if backup servers are physically close to each other.

However, suppose that your data-center suffers a major network and power outage (though uncommon, can still occur). IP takeover isn’t going to help in this case as all servers go down (being in same data-center). To prevent such a scenario, a technique called DNS failover [pdf link] is used. Whenever a user requests your website (www.example.com), browser requests DNS servers to determine which IP address corresponds to that URL. That is, browser contacts DNS servers to resolve www.example.com to an IP address of the main server. In DNS failover, your website is continuously monitored and if it is detected to be down, DNS servers are automatically updated to point www.example.com to a backup server (located in a geographically separated data center). So, all requests for www.example.com which earlier resolved to your main server start resolving to your backup server. This means that if you suffer major outage in our main data center, you can start serve requests from your backup server in a different data center.

However, this is an important caveat. DNS enteries have a Time-To-Live (TTL) values attached to them which tell browser for how long it should cache the response to DNS resolution query. That is, if your DNS entry TTL is 20 minutes, browser will make one DNS lookup for your website and cache the results so that it doesn’t have to lookup results for another 20 minutes again. The downside is that if your server goes down within those 20 minutes, the browser will not see the updated (backup server) IP address for the whole 20 minutes and visitors will see your servers as down. (New visitors will see the backup server as they will make a fresh DNS resolution query).

This can, of course, be solved by having a low TTL value. At Visual Website Optimizer, we are using a TTL value of 2 minutes so that in case of downtime DNS is updated within 2 minutes to reflect backup server availability. That is, within 2 minutes Visual Website Optimizers will appear to be up, even if there is a major outage/downtime. If typical response time is 2 minutes, why do we say theoretical downtime is 30 minutes? That’s because Internet Explorer caches DNS for 30 minutes irrespective of the TTL value. This means that no matter how low is your TTL value, Internet Explorer users who are using your website (at the time it goes down) will see servers remain down for at least 30 minutes. Note that all new users (including IE) will do a fresh DNS query and see your website as up. For more details on IE’s atypical behavior read this PDF document.

If you are Visual Website Optimizer user (or even user of any other service), you now know the maximum theoretical downtime you can expect. For VWO users, because you have JavaScript which contacts our servers, your website will not go down but rather slow down if our servers go down. Note that it is an extreme case as our uptime is already 99.8% which means you may not have to experience any downtime at all. But we believe educating customers on worst-case scenario is beneficial. Also note that you always have an option of self-hosting JavaScript files thereby eliminating all synchronous contact with out servers and avoiding any slowdown at all (in rare cases).

If you are a technical person reading this and think we are misguided in our research, please leave a comment or email us at info@wingify.com. If you have a better solution for reducing downtime impact, we will of course love to be wrong 🙂

Author

CEO of @Wingify by the day, startups, marketing and analytics enthusiast by the afternoon, and a nihilist philosopher/writer by the evening!

(2) Comments

Leave a Comment
  1. At tummy.com, we do a lot of Linux high-availability work. To answer your question about whether the two servers in an IP fail-over situation need to be on the same switch: No, they don’t.

    IP fail-over typically is done via sending a gratuitous ARP out when the IP change is done, so that things on that segment get the new MAC address for the IP. The MAC on each system stays the same, so a layer-2 switch doesn’t see any change — each machine’s MAC stays on the same interface as before.

    It’s the equipment on that segment (most importantly, usually, the router) that notice the change. They get the gratuitous ARP message and update their ARP tables, and start sending the traffic for that IP to the new MAC address.

    Once you start thinking about what is happening at the MAC level to pass packets around a local network segment, this all becomes pretty clear.

    Sean

  2. Thanks for posting this, I better show this to my system admin so we can also study our website and network.

Leave A Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes : <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Close