ERPNext High Availability

A look into ERPNext deployment architecture

 · 2 min read


High Availability refers to achieving the following goals in your setup

  • Availability of service in case of single or multiple node failures.
  • No data loss in case of single or multiple node failures.
  • Availability of good read throughput in case of high traffic.

The Ideal Architecture


Load balancer

The load balancer distributes incoming HTTP requests among application server and the service isn't disrupted in case of a node failure. This makes the service "horizontally" scalable in terms of compute load. If you have only one of these, it's a single point of failure.

Normally, you'd have more than one of Load balancers and they themselves loadbalanced via DNS (which is distributed in true sense). If you have the budget, time and expertise, you can also have the multiple loadbalancers setup via a heartbeat like system.

Application Servers

Application servers process incoming HTTP requests. They query databases/cache if required but do not maintain any state. As long as one of these is up, the service isn't down.

Background Workers

Background workers execute scheduled jobs.

Memcached and Redis Services

ERPNext also depends on Memcached for caching and Redis as a broker for background task workers.

Memcached is distributed by design and adding all available memached nodes in application server configuration is all the configurations that is required. Failure of memcached doesn't cause the service to go down. Failure of one or more memcached nodes is handled automatically by the client (application server).

Failure of a Redis server would cause scheduled and backround tasks of the service to go down. It's possible to setup multiple redis servers in master-slave fashion and software exists to perform automatic failovers (http://redis.io/topics/sentinel).

Database (MariaDB) cluster

The Database cluster consists of multiple nodes and exposes itself as transparent database service to the application servers.

This enables multi node failure and ensures "availability" and "no data loss" in case of a node failure. This however, increases the complexity of your setup.

Setting up a cluster with automatic fail over requires setup of complex monitoring and cluster management software (such as Galera from MariaDB or Percona Cluster Manager). They typical run in multi master and a few read slaves setup. We do not have experience in running this stack in production.

Also, having automatic failover configured is risky. If something goes wrong when your sysadmin is away, data inconsistency/loss or availability issues might occur. This happens to the best in the industry too, https://github.com/blog/1261-github-availability-this-week.

What we do

What we offer is the (simple) setup below. Although single server setup acts like a single point of failure, chances of complexities during failover are less.


Crash plan

  • Backup & rsync (if possible)
  • change slave to master
  • start services (redis, supervisor, nginx) on slave
  • switch DNS

Pratik Vyas

Pratik takes care of Frappecloud and nags everyone about blasphemous engineering practices. He's also responsible for any cryptic responses and texts related to frappe and erpnext that you may find.

3 comments
James August 25, 2021

these are best hacks to understand

High availability has been made for the good purposes and to achieve the important goals. If it still works after the failure of nodes then this high availability database will be beneficial for the companies that have large data.

Gil Salazar July 23, 2020

thanks

Add Comment