Tag1 Consulting

Performance and Scalability Experts

Achieving high availability on EC2

This last week I've had the fortune to have some spare time to play around with Amazon's Elastic Compute Cloud (EC2). I'm pretty interested in the potential for scaling the LAMP stack by having a programmable cluster at the service of your box. A lot of the documentation I find seems to be by people either scaling via dynamic DNS additions when they add more nodes, or by using EC2 nodes as application servers used internally by their application. Dynamic DNS has never really been a solution I was content with, as I prefer to be able to programmatically control which nodes are receiving requests without waiting for DNS TTLs to expire. Sure, most web browsers do attempt to failover when you are using round-robin and a node is down, but that does not solve the problem of a malfunctioning node that is up but returning bad data or of making immediate changes to weights in order to rebalance load.

The direction I'd like to go with EC2 and LAMP is to use a combination of an elastic IP, availability zones, and IPVS to build a high-availability scalable service. The elastic IP would replace heartbeat to provide the entry point; an externally-accessible load balancer. Having Amazon act as an agent for the assignment of the elastic IP also should prevent the split-brain issue that can occur with some high-availability solutions when nodes lose communication with each other. Even if multiple nodes contest for the IP, only one can have it at a time. Each node can act as a load balancer, fielding requests from its own web server or from the nodes in the other availability zones.

The issue will be in managing the configuration of the IPVS table. Because the eventual plan will to be able to have a dynamically-sized cluster, how will what nodes are available be determined? My hope is to use mon, an incredibly customizable piece of monitoring software, to be both the controller for growing and shrinking the network as well as for maintaining the IPVS tables. Separating the management of the cluster from the web app should provide a generic solution that will work for any DB-backed web app.