Load Overflow

How the stock balance report brought down one of our servers and how we ensured that it doesn't happen again

February 27, 2016

If HTTP requests were cattle, then the web servers are the destinations they stampede towards. The web server gets thousands of cattle HTTP requests per second and it has a very short time to send back a response. If the server gets more requests than it can respond to, it leads to a bottleneck and the excess requests don’t receive any response, leading to a "Request Timed Out". I bet some of you have seen this before, when your Stock Balance report kept showing "Loading...", which by the way is a great achievement – you brilliantly successful people with millions of stock ledger entries! You are the reason we have to adapt the ERPNext code, ensuring its survival and evolution. So, it was the stock balance report that brought down one of our servers last week. We have about 250 ERPNext sites on this server. A user *(codename: X)* of one of these sites was having a usual day, but unlike in the past, they opened up a few tabs of stock balance report instead of a single one, with the intention to set the Item filter in each and hit refresh. However, the act of loading multiple reports simultaneously overloaded the web server. You see, this web server can respond to 8 requests at a time *(8 processes)*. Usually, it takes just a few milliseconds for the server to respond. But the stock balance report loads a lot of data from the database and sums it up to find the available inventory in each warehouse for the selected period. Now, if you have a million rows of data, it could take more than 2 minutes to load and process it, which exceeds the time limit we have set for the web server to respond to such requests. If the user had opened up 6 tabs of stock balance report, 6 of these processes would be occupied and only 2 would be available to respond to the rest of the requests. This led to a massive slowdown for the users on this server. Although the main problem was the stock balance report, we needed to make sure that user X’s intent to get the latest inventory values cannot bring down the server. Here’s where Nginx comes to the rescue. Nginx is a "reverse-proxy", which is akin to a traffic signal for the web server. It directs the traffic to the web server, but when the web server is overloaded, it kindly responds with "503 Service Temporarily Unavailable". Using the following Nginx configuration, we limited the number of simultaneous active requests that each ERPNext site is allowed at any moment: limit_conn_zone $host zone=limit_per_site:10m; server { ... location @web_server { limit_conn limit_per_site 4; ... } ... } Very simple, isn’t it? You can replace `$host` with `$binary_remote_addr` to limit connections per IP address. After activating this configuration, even if user X opens 8 tabs of Stock Balance report, Nginx will allow only 4 of them to load. The other 4 tabs will see "503 Service Temporarily Unavailable" error page. This restriction for each site keeps the other 4 web server processes available to requests from users of the other 249 sites, and thus preventing a single site from monopolizing the web server. *(P.S.: We eventually did optimize the stock balance report, bringing down its load time from 4 minutes to a minute, for a million records.)*
Anand Doshi
Anand Doshi
"Anand is the Chief Technology Officer at ERPNext. He reads fiction, dabbles in photography and is always on the watch for the best ToDo app."
No comments yet

No comments yet. Start a new discussion.

Add Comment