Scaling ERPNext with MySQL

Scaling is what I call a "good" problem to have. As we keep adding customers and trial accounts, our

 · 5 min read

(Suggestions Wanted)

Scaling is what I call a "good" problem to have. As we keep adding customers and trial accounts, our servers have started showing signs of overloading. To give you an idea, we have close to 100 live ERPs and another 100 trials running on our server. Each account is a separate database instance with over 200 tables. Here is our configuration:

  1. Intel i7 8-core processor, 1.6GHz
  2. 24GB RAM
  3. Seagate 1500GB hard disk.

As you know our database is InnoDB (MySQL) and the current size of our InnoDB file is 16GB.

Though we are reasonably passable when it comes to writing software, none of us is an expert in infrastructure issues and system administration, though Anand comes closest to being somewhat qualified. So here is what we did.


To monitor the system performance we use:

  1. MySQL Slow Query Log: This records queries that take long to execute.
  2. MySQL processlist: we keep refreshing this to see how many pending queries are there at one time.
  3. HTOP: The nice graphical utility that shows how the CPU and RAM are performing.

The usual suspects for any performance issues are RAM and CPU utilization. So, initially when we received complaints of slow responses, we checked all the above and we did not see anything unusual, the CPU and RAM were well below usage. It was only during upgrades that we saw all the 8 CPUs being fully used.

The slow query log showed something unsual though. Apart from the occasional complex join query, most of it was filled with "COMMIT" statements. If you know a bit of database, this basically means the point where all the updates are becoming "permanent" i.e. being written to the disk.

My first reaction was that - now this isn't helpful. I mean its not telling me if the write is slow for an Invoice entry or a Stock entry or something like that, so I went back to the processlist to see what tables are getting locked, which is not too helpful either. Its like trying to read the destination written on a running train.

Sometimes when you start a line of thinking, you miss the obvious. I assumed that we can improve our performance by identifying which part of the code can be optimized, because I believed that we had enough hardware to get this running. I ignored the writing on the wall.


Finally, I had a "tube light" moment.

The hard-disk was the bottleneck!

So I started to investigate a bit and reading a few blogs [1]. The first thing you get is into RAID systems and how you can optimize performance on them. Now since I am a complete newbie, I had heard the word RAID before, but had no idea, also it is something that we needed to decide at the time of setting up the servers. Since we were well past that, I looked into ways on how to verify if my hypothesis is correct.

In Linux you can find almost anything you are looking for and I was looking for a utility called iostat. This tells me how to monitor performance of the disks. After learning a bit on how to read iostat, I realized that the key figure I need to look at to test my hypothesis was the amount of write instructions on the disk. The next thing I wanted are benchmarks, so that I know what is a reasonable average queue size.

Unfortunately there was no benchmark available, but a lot of people had posted what their iostat outputs. When I saw this, it immediately became clear that the amount of stuff we were writing was very high:

So I think we can conclude that along with the above table and the "COMMIT"s in the slow query log, the hypothesis is correct, the bottleneck is the hard disk. You can process as many parallel requests as you can but all need to queue up to write on the disk. Apart from that, we also have a web server running, and god knows how many logs waiting to write.

This means that while the status of usual suspects, RAM and CPU was this:


The status of the hard-disk was this:

Next Steps

Now that our stupidity is obvious, we need to quickly decide what to do. I read up a bit on RAID, but its seems an added level of complexity. What I am thinking is adding more servers with fewer accounts on each server, with not as much RAM and CPU horsepower. This will give us capacity to add low cost machines and balance the traffic between memory and disk.

If you are an expert, would love to hear from you.

Post Script #1

Thanks for the awesome response. We are running a master-slave setup (for redundancy). We need binlogs not only for this but also for debugging.

Here is some more info:

my.cnf (stripped)

This is basically the standard config that gets shipped with Percona release with some increased caches:





Output from

-------- General Statistics --------------------------------------------------
[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.0.86-ius+percona-highperf-log
[OK] Operating on 64-bit architecture

-------- Storage Engine Statistics -------------------------------------------
[--] Status: +Archive +BDB +Federated +InnoDB -ISAM -NDBCluster
[--] Data in MyISAM tables: 322M (Tables: 931)
[--] Data in InnoDB tables: 6G (Tables: 70633)
[!!] BDB is enabled but isn't being used
[!!] Total fragmented tables: 91

-------- Performance Metrics -------------------------------------------------
[--] Up for: 43d 22h 25m 9s (347M q [91.574 qps], 11M conn, TX: 541B, RX: 30B)
[--] Reads / Writes: 73% / 27%
[--] Total buffers: 10.2G global + 11.2M per thread (500 max threads)
[OK] Maximum possible memory usage: 15.7G (66% of installed RAM)
[OK] Slow queries: 0% (5K/347M)
[OK] Highest usage of available connections: 16% (83/500)
[OK] Key buffer size / total MyISAM indexes: 64.0M/2.7M
[OK] Key buffer hit rate: 99.8% (610M cached / 1M reads)
[OK] Query cache efficiency: 56.4% (104M cached / 185M selects)
[!!] Query cache prunes per day: 454539
[OK] Sorts requiring temporary tables: 1% (48K temp sorts / 4M sorts)
[!!] Temporary tables created on disk: 40% (12M on disk / 31M total)
[OK] Thread cache hit rate: 99% (9K created / 11M connections)
[!!] Table cache hit rate: 0% (2K open / 14M opened)
[OK] Open file limit used: 0% (121/65K)
[OK] Table locks acquired immediately: 99% (111M immediate / 111M locks)
[OK] InnoDB data size / buffer pool: 6.5G/10.0G

-------- Recommendations -----------------------------------------------------
General recommendations:
   Add skip-bdb to MySQL configuration to disable BDB
   Run OPTIMIZE TABLE to defragment tables for better performance
   When making adjustments, make tmp_table_size/max_heap_table_size equal
   Reduce your SELECT DISTINCT queries without LIMIT clauses
   Increase table_cache gradually to avoid file descriptor limits
Variables to adjust:
   query_cache_size (> 32M)
   tmp_table_size (> 64M)
   max_heap_table_size (> 64M)
   table_cache (> 2048)

Post Script #2

I had mistakenly idenified packet size in iostat as the queue size and had posted a screenshot. I have removed it so as not to mislead. The original hypothesis remains correct. By reducing the number of writes, we were able to speed up the processing.

But there is no alternative to comprehensive monitoring. 

[1] Blog to read on analyze disk bottleneckxw issues:

[2] Picture Courtesy: Michael Loke, Creative Commons via Filckr

[3] Picture Courtesy: Frederick Bisson, Creative Commons via Flickr

Rushabh Mehta

Rushabh is a software developer and founder of ERPNext. He usually writes about the startup experience, open source and the technologies he is working on.

Add a comment
Ctrl+Enter to add comment

Rushabh Mehta 10 years ago


Thanks for the execellent tips.

We do you Percona. We did a lot more optimization in the a

Manish Singh 10 years ago

This is 1+ year old but thought I would give you my todo list for scaling the database

  1. Switch to
Chris Sherlock 11 years ago

What do you mean by a "metadata driven engine"?

Amit Kumar 11 years ago

You might want to check if your updates are creating temporary tables. That could be an issue too

Scalability Genius 11 years ago

@Ravi - I think he's saying that his database size is 16GB which is orders of magnitude smaller t

Rushabh Mehta 11 years ago

@Chris, mostly inserts and updates. We have removed most of the redundancies, but the General Led

Piyush 11 years ago

I think you have answered your question with that POST vs GET thing. You need to remove empty com

Chris Sherlock 11 years ago

Probably nor elated, but disable BDB. You don't need it.

That's a lot of badly fragmented

Amit Kumar 11 years ago

@Rushabh So as it seems, your DB is fragmented, this causes write to be broken. Fix that.


Rushabh Mehta 11 years ago

Thanks for the awesome comments and feedback.

@Scalability Genius, will have to ask our p

Chris Sherlock 11 years ago

Me again. Darned iPad. Change "snorts" to "inserts" and for "won't RAID" I of course meant "want

Piyush 11 years ago

Post your my.cnf

Chris Sherlock 11 years ago

Looks like you have a lot of writes going on there. Look at the wkB/sec! Massive. Almost no reads

Ravi 11 years ago

@Scalability Genius: Yea; it will work; But imagine the cost of buying a 1.5TB SSD disk. !!

Scalability Genius 11 years ago

The first thing you should do is to switch to an SSD.