High Availability refers to achieving the following goals in your setup
- Availability of service in case of single or multiple node failures.
- No data loss in case of single or multiple node failures.
- Availability of good read throughput in case of high traffic.
The Ideal Architecture
Load balancer
The load balancer distributes incoming HTTP requests among application server and the service isn't disrupted in case of a node failure. This makes the service "horizontally" scalable in terms of compute load. If you have only one of these, it's a single point of failure.
Normally, you'd have more than one of Load balancers and they themselves loadbalanced via DNS (which is distributed in true sense). If you have the budget, time and expertise, you can also have the multiple loadbalancers setup via a heartbeat like system.
Application Servers
Application servers process incoming HTTP requests. They query databases/cache if required but do not maintain any state. As long as one of these is up, the service isn't down.
Background Workers
Background workers execute scheduled jobs.
Memcached and Redis Services
ERPNext also depends on Memcached for caching and Redis as a broker for background task workers.
Memcached is distributed by design and adding all available memached nodes in application server configuration is all the configurations that is required. Failure of memcached doesn't cause the service to go down. Failure of one or more memcached nodes is handled automatically by the client (application server).
Failure of a Redis server would cause scheduled and backround tasks of the service to go down. It's possible to setup multiple redis servers in master-slave fashion and software exists to perform automatic failovers (http://redis.io/topics/sentinel).
Database (MariaDB) cluster
The Database cluster consists of multiple nodes and exposes itself as transparent database service to the application servers.
This enables multi node failure and ensures "availability" and "no data loss" in case of a node failure. This however, increases the complexity of your setup.
Setting up a cluster with automatic fail over requires setup of complex monitoring and cluster management software (such as Galera from MariaDB or Percona Cluster Manager). They typical run in multi master and a few read slaves setup. We do not have experience in running this stack in production.
Also, having automatic failover configured is risky. If something goes wrong when your sysadmin is away, data inconsistency/loss or availability issues might occur. This happens to the best in the industry too, https://github.com/blog/1261-github-availability-this-week.
What we do
What we offer is the (simple) setup below. Although single server setup acts like a single point of failure, chances of complexities during failover are less.
Crash plan
- Backup & rsync (if possible)
- change slave to master
- start services (redis, supervisor, nginx) on slave
- switch DNS