innobackupex Procedure
The innobackupex script produces a backup that is a datadir replacement for your mysql installation.- Copy tablespaces (ibdata and idb files)
- Copy rest of the .frm, .MRG, .MYD, .MYI, .TRG, .TRN, .ARM, .ARZ, .CSM, .CSV, .par, and .opt files
- Apply the logs it collects while copying the files (using innodb crash recovery)
For more info, read Percona's documentation.
Issue#1 Open files limit
We use innodb-file-per-table and innobackupex opens all tablespaces at once to avoid issues relating to tablespace renames. The backup and MySQL crashed when it tried to open over 100K files at once. There's a discussion about this on XtraBackup's launchpad.Downtime: 5 minutes for MySQL to start again
Fix: After some reading we decided to raise the file limit (set to 64K before). MySQL, after starting raises the open files limit for itself using ulimit to the value of open-files-limit in my.cnf. This requires a restart and I reduced the restart time to ~30s thanks to this DBA StackExchange answer.
Issue#2 Global READ Lock
Before xtrabackup copies .frm, .MRG, .MYD, .MYI, .TRG, .TRN, .ARM, .ARZ, .CSM, .CSV, .par, and .opt files (ie. files other than the innodb datafiles), it acquires a READ lock to disallow any DDL that may change these files. When our backup reached this stage, our service went down as all session INSERT and UPDATE statement were waiting for this lock and all requests timed out. After five minutes of waiting for the copying to finish, I had to kill the backup to bring the service back up.Downtime: 7m
Fix: The innobackupex utility has an option --rsync, which instead of copying files after aquiring the lock, first rsyncs them without locking and rsyncs again with a lock to sync changes, reducing the lock time (we didn't notice any downtime :)). More info here and here
Percona's documentation also has a page on this which talks about the part when the backup script is not able to acquire the READ lock because of currently running queries. It talks about options you can pass that can either allow the quieries to acquire lock, kill queries if they take long, etc. We didn't have to use those.
Gratitude
Thank you Percona for such an awesome (OpenSource) utility. I especially liked the part where the solutions to the problems I faced are documented (and didn't have to get ideas from a dude/dudette's blog post, but from the official docs).TL;DR: We had to increase open files limit and used `--rsync` option to reduce lock time.