MySQL Hot Backup in Real Life

We used innobackupex to take a hot backup of our MySQL instance on Frappecloud

 · 2 min read

We had decided to delay setting up replication to after moving all accounts to frappecloud. That meant setting up replication for over 500 databases but we were banking on a tool called innobackupex from the folks at Percona to take an "online" backup without any downtime (mysqldump would take about 90m). I had read a few blog posts on the web about setting up replication with innobackupex with no hickups but due to Monsoon in India, we didn't have a sunny day.




innobackupex Procedure

The innobackupex script produces a backup that is a datadir replacement for your mysql installation.

- Copy tablespaces (ibdata and idb files)
- Copy rest of the .frm, .MRG, .MYD, .MYI, .TRG, .TRN, .ARM, .ARZ, .CSM, .CSV, .par, and .opt files
- Apply the logs it collects while copying the files (using innodb crash recovery)
 
For more info, read Percona's documentation.

Issue#1 Open files limit

We use innodb-file-per-table and innobackupex opens all tablespaces at once to avoid issues relating to tablespace renames. The backup and MySQL crashed when it tried to open over 100K files at once. There's a discussion about this on XtraBackup's launchpad.

Downtime: 5 minutes for MySQL to start again

Fix: After some reading we decided to raise the file limit (set to 64K before). MySQL, after starting raises the open files limit for itself using ulimit to the value of open-files-limit in my.cnf. This requires a restart and I reduced the restart time to ~30s thanks to this DBA StackExchange answer.

Issue#2 Global READ Lock

Before xtrabackup copies .frm, .MRG, .MYD, .MYI, .TRG, .TRN, .ARM, .ARZ, .CSM, .CSV, .par, and .opt files (ie. files other than the innodb datafiles), it acquires a READ lock to disallow any DDL that may change these files. When our backup reached this stage, our service went down as all session INSERT and UPDATE statement were waiting for this lock and all requests timed out. After five minutes of waiting for the copying to finish, I had to kill the backup to bring the service back up.

Downtime: 7m

Fix: The innobackupex utility has an option --rsync, which instead of copying files after aquiring the lock, first rsyncs them without locking and rsyncs again with a lock to sync changes, reducing the lock time (we didn't notice any downtime :)). More info here and here

Percona's documentation also has a page on this which talks about the part when the backup script is not able to acquire the READ lock because of currently running queries. It talks about options you can pass that can either allow the quieries to acquire lock, kill queries if they take long, etc. We didn't have to use those.

Gratitude

Thank you Percona for such an awesome (OpenSource) utility. I especially liked the part where the solutions to the problems I faced are documented (and didn't have to get ideas from a dude/dudette's blog post, but from the official docs).

TL;DR: We had to increase open files limit and used `--rsync` option to reduce lock time.
 

Pratik Vyas

Pratik takes care of Frappecloud and nags everyone about blasphemous engineering practices. He's also responsible for any cryptic responses and texts related to frappe and erpnext that you may find.

No comments yet

No comments yet. Start a new discussion.

Add Comment