Frappe Technologies
Screenshot 2023-10-25 at 10.48.41 PM.png
Walking through into a Frappe Cloud Incident
How we tackle incidents while giving a RCA to the recent f1-sydney outage.
author

By

Mangesh

·

20 January 2026

·

6

min read

Me during any incident which revolves around data deletion.

Preface

We have been recently working on bringing auto-scaling to Frappe Cloud and are almost ready to make it public. 4 of Frappe Cloud's public servers are now using auto-scaling including frappe.io production server.

On 12th January, 2026, Frappe Cloud faced an almost 5-hour outage on one of our public servers in the Australia region. This blog post is an RCA of the incident while also explaining how the Frappe Cloud team tackles such incidents.

Below is a table of events and timestamps from when we identified the issues

Event Timestamp
Low Memory Incident Alert on f1-sydney 13:05
Unnecessary auto-scale triggered 14:40
Sites Down 16:50
We found all the data and started recovery 20:12
Recovery completed and Sites were Up 21:20

Investigating the Incident

2 months ago, when auto-scaling was still in its infancy we had set it up on f1-sydney. Primarily, because it was the newest cluster with very little traffic and fewer sites on it. It was an ideal target to test auto-scaling. Also, the way auto-scaling is done now is a bit different from how it was done 2 months ago, and it wasn't setup with new processes on the Sydney server.

The problem did not really start from a click on the server dashboard but from an out-of-memory issue on the App Server. To what the Analytics charts say, f1-sydney was facing low memory issues from approximately one day before this incident happened. Whenever there is an incident happening the Frappe Cloud team gets SMS and phone calls regardless of what time it is. But, we receive a call only when either of Site, DB or the App server are down. There was nothing down on the day before that, so the incident investigator didn't bother us. But, on 12th January, 2026, we received our first notification that the Database on f1-sydney was down. We tried rebooting the server but that did not really help. The server was also at its maximum resource utilization. Now, you might wonder if it had auto-scaling enabled, why did it not auto-scale already? It should've happened but it did not happen because it needed to be scheduled.

The same day was also the Version 16 release date. We had very few people in FC to work on any such abrupt incident. Most of the engineers were busy setting up prerequisites on the Frappe Cloud side for version 16. It was around 4 PM when Version 16 was almost going to go public, people were creating releases and tags with "version-16" and that is when we started getting calls from f1-sydney. All the 130+ sites on f1-sydney were down.

The effect on the customer sites was "site-url 404 not found". One of our engineers quickly started investigating what was wrong and why all the sites returned 404 altogether. 1 hour had passed, and our engineer who was looking into the incident still couldn't find the root cause. By now, we had started getting tickets regarding this, but the debugging was still going on with near to nothing useful.

Four of us got on a call and we started discussing about what all actions were performed before and after the incident. We never really have time to discuss this because the first priority is always to get the sites up as soon as possible. But the discussion always continues while one engineer is sharing their screen. During the discussion, we got to know that auto-scale was triggered on the sydney server and auto-scaling setup on that server wasn't up-to date. This also means that the setup was near to broken and an auto-scale shouldn't have been triggered. But lack of knowledge had led one of our engineers to trigger an auto-scale on it. The scale-up triggered on the server failed. So, to disable auto-scaling on that server there was a tear-down performed on the secondary server. Both jobs had failed mid-way.

In short, we achieve zero down time by using the secondary server and routing everything to the secondary server while the scaling is complete on the primary server. Once, the scaling is completed, we revert the traffic and processes to the primary, and the secondary server stays behind waiting for its next scaling opportunity. Currently, auto-scaling is available both automatically and on a scheduled basis. It can be configured to trigger automatically as well as can be scheduled to happen at a particular time in day.

Tearing down the secondary server aka disabling auto-scaling


By now, the information we had was that all the sites were down, because their benches were missing from the server. On Frappe Cloud, every site is installed on a bench, and that bench is basically a docker container. All the sites run inside that docker container.

Where did the benches go?

While the sites were dead, agent was still running on our production server and was trying to do its day to day job of taking backups, updating sites whenever triggered by the customer and etc. But, all the sites were down and agent kept on returning the status "Undelivered". This made complete sense. We kept digging further into the details of the agent jobs.

We checked for the data in the volumes attached on the primary and secondary server. We prepared a list of missing benches on f1-sydney and started finding the rest of the benches. On Frappe Cloud, behind the scenes, all the add on disks bought from the cloud providers are mounted as /opt/volumes/benches/. We checked them on both servers, but we couldn't find the missing benches in either of the volumes.

One engineer suspected that since things broke down in between it could be probably that there was a sync happening between the directories on the primary and secondary server. And this was correct.

Remember the secondary server tear-down that failed mid-way? That mid-way was when the benches had started moving to a common parking space for all the benches called /home/frappe/shared. This directory is a common volume which is mounted across the primary, secondary and the NFS server. The copying of benches to /home/frappe/shared was interrupted. That is why only a few benches were on the primary and the rest of the benches were on /home/frappe/shared. At this point, there was a sigh of relief. We had found all the benches and there was no sign of any data loss.

The job steps from the scale-up were clear evidence to that hunch.

Syncing between primary and secondary server


Recovering

Ideally, the /home/frappe/shared directory was made so that once the server scales up, that directory would serve as the primary base from where the containers would start. You can also see in the above image that there is syncing happening between both the directories (/home/frappe/shared and /home/frappe/benches). By default, on a server without auto-scaling all the containers run from the /home/frappe/benches directory.

Things after this were pretty straightforward, we had to make sure to move all the benches on their desired locations, restart the benches and the sites should be functioning normally.

cd /home/frappe/shared
mv ./* /opt/volumes/benches/home/frappe/benches/ -v

Once the copying was done, we restarted all the containers. That took quite some time. We had to restart 38 containers on that server and it took around 40 minutes.

After almost 5 hours, all the sites were now up but with broken CSS and missing icons. The data was completely untouched, but the assets for all the sites were broken. This was the final step—we manually rebuilt all the assets for the sites and everything was good to go. Rebuilding all the assets took another 1-2 hours.

Human error is sometimes undeniable

This incident was purely due to poor communication and lack of clean-up.

With every incident our team learns a lot of harsh lessons. The frequency of incidents happening on our server has seen a gradual decrease over time.

Published by

Mangesh

on

20 January 2026
2

Share

Add your comment

Success!

Error

Comments

G
Guest

· 

January 20, 2026

Glad everything worked out at the end These incidents are all a part of life Keep up the great work

Mangesh Mane

· 

January 20, 2026

Learning comes with humiliation!

Discussion

image7f7d4e.png

Paul Mugambi

·

3 days

ago

Beautiful read, and an insight into an individual I respect and have learned a lot from. Am inspired to trust the process and never give up.

image4c43d6.png

Anna Dane

·

5 days

ago

I must say this is a really amazing post, and for some of my friends who provide Best British Assignment Help, I must recommend this post to them.

Add your comment

Comment