Why we moved from Celery to RQ

ERPNext has features like email inbox syncing, event reminders, recurring orders and automatic item ordering. Some of these tasks run every 5 mins, some every hour and others are executed at midnight. In ERPNext version 3, we moved from using Cron to Celery for handling such periodic tasks. Celery is the industry standard for background jobs in python, is feature rich and is built for performance. It covered all the use cases we needed and enabled additional features like non-periodic background tasks. However, over time, we realized that we could never conceptualize a clear model of how Celery worked and consequently, had a difficult time debugging it. In contrast, RQ was so simple, we could get it up and running in a short time with all the features that we used in Celery.

Documentation

Reading the Celery documentation feels like going through a technical paper. It has an amazing set of features, but we should have realized that we won’t be using most of them. Celery is not beginner friendly. Help about simpler use cases is relegated to the bottom sections, while complex ones are highlighted and often blogged about. That’s how we missed the apply_async(queue='queue_name') feature, and instead implemented its manual routing system, which was unintuitive and felt like an ink blot on a perfectly good shirt. In contrast, RQ’s documentation is short. It takes you 15 minutes to read the whole documentation and it was enough to replace Celery in our existing setup.

~~Opinionated~~

Celery gives you a lot of choice. You get to choose the broker, the result store, serialization format, concurrency options and scheduling. It is powerful, optimized for performance and can also work with languages other than Python. However, its defaults are optimized for a large scale. One such default came to bite our back. Celery workers are pre-assigned tasks whether they are free or not. This led to some busy workers which had many pending jobs, while a few free workers didn’t have anything to do until the next beat. Thanks to a blog post, we stumbled upon the solution to this problem, although we later found it to be mentioned in the official documentation under optimizations. In comparison, RQ works only in Python, uses Pickle for serialization, Redis as broker and backend, uses os.fork for child process and does not provide scheduling. Apart from using another simple library for scheduling, we could implement all our use cases with a clear understanding of how RQ works.

Fairness

We have thousands of ERPNext sites on our servers. We needed to ensure that each site gets its fair share of background workers. Our first thought was to start sufficient number of workers and enqueue jobs for site if they don’t exists in the queue. However, there is no easy way to programmatically check for queued jobs in Celery. So, we created separate queues for each ERPNext site instead, and Celery workers would pick jobs from these queues in a round robin manner. In hindsight, we realize that this was over-engineered and could have been easily solved by implementing inspection and custom job naming. On the other side, RQ lists all jobs in a queue using Queue.jobs. It couldn’t be simpler than that.

In defense of Celery, it was partially our fault that led to the additional complexity. However, like Python, RQ has only one way to do a thing and that makes it very difficult to over-complicate and over-engineer. RQ is intuitive, has a simple documentation and it just works. In the end, software engineering is about trade-offs, and using RQ instead of Celery seems to be a beneficial one.