Carl Sagan famously said, "If you wish to make an apple pie from scratch, you must first invent the universe". Like an apple pie, many things that seem simple at first are extremely complex; it's just that we choose to ignore the complexity for practical purposes. In this post, I'll try to go as low as I can in abstractions of Frappe Scheduler, but I don't promise a recipe for inventing the universe.
Frappe's scheduler is one of the features that truly make it "batteries-included". It lets you specify recurring jobs with frequency or even CRON expressions. It lets you update these config in real-time without restarting scheduler process. It even lets you schedule custom code written from UI a.k.a. Server Scripts.
So how does all this work behind the scenes? It's surprisingly simple on surface and increasingly complex the deeper you go. Here's the rough python code for scheduler.py with important bits.
def start_scheduler() -> NoReturn:
scheduler_tick = get_scheduler_tick()
while True:
time.sleep(scheduler_tick)
for site in get_sites():
for job in get_all_jobs(site):
if job.is_due():
job.enqueue()
The scheduler is just an infinite loop that keeps checking if any job on any site is due, and then enqueues it. As I said, it is surprisingly simple on the surface, except that scheduler tick business, which seems interesting.
Scheduler Tick
time.sleep(scheduler_tick)
What is that scheduler tick? Well, that tick long sleep is what prevents this infinite loop from hammering CPU constantly. Most Frappe jobs do not have granularity lower than a minute, so checking for pending jobs constantly is busy looping with little to no benefit. By default Frappe uses scheduler tick of 60 seconds to sleep for 60 seconds after every loop.
Scheduling Delays
If we sleep for 60 seconds between each loop, it implies that your job can be delayed by 30 seconds on average. This small error is usually acceptable compared to amount of CPU wastage we'd incur from busy looping.
Except, it's not always 30 seconds. On multi-tenant system we don't want to enqueue all long-running jobs at 12:00 AM. That would add significant sustained load on the server and might degrade overall performance, so we also add some scheduling jitter to delay long-running jobs randomly by up to 10 minutes. Is this ideal? No. But rarely anyone cares if their birthday reminders are sent at 12:00 AM or 12:10 AM.
Here's code for adding jitter. It is, again, surprisingly simple.
next_execution = get_next_execution() # based on cron expression
next_execution += timedelta(seconds=randint(1, 600))
That wasn't so hard to understand so far, so where is the complexity I promised?
time.sleep()
How do you think time.sleep()
works? A conceptual model of time.sleep
in Python might look something like this:
def sleep(seconds):
now = datetime.now()
wait_untill = datetime.now() + timedelta(seconds=seconds)
while wait_untill > datetime.now():
continue
return
If you implement this, it will totally work as expected except we just introduced busy looping again. Well, at least we are not calling the database in a busy loop so it is an improvement right? Surely, it can't be this dumb? Let's see CPU usage of scheduler when it's idle.
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
110408 ankush 39 19 114M 75184 22272 S 0.0 0.5 0:22.84 frappe-scheduler: idle
It's 0% and total usage is 22 seconds even though it has been running for an hour, so it's definitely NOT busy-looping. That means Python's implementation of sleep is more efficient than our constructed mental model.
Finding what time.sleep
actually does.
The best tool for understanding abstractions is "Go To Definition" in your code editor. So I tried it on time.sleep
. It gives you def sleep(seconds: float, /) -> None: ...
which is just a type-stub instead of an implementation, this implies it's actually implemented in C.
Now, we could go read cpython
code and figure out what time.sleep
is doing but doing this sleep efficiently seems like operating system's job. So easier method is to just trace system calls and take it from there. Linux has excellent tool called strace
to trace all system calls that are being made. So let's write a trivial sleep program and trace its system calls.
# sleep.py
import time
time.sleep(5)
$ strace python sleep.py
This gives a lot of output, but it's easy to find interesting system call because we see 5 second pause around it.
read(3, "import time\n\ntime.sleep(5)\n", 4096) = 27
lseek(3, 0, SEEK_SET) = 0
read(3, "import time\n\ntime.sleep(5)\n", 4096) = 27
read(3, "", 4096) = 0
close(3) = 0
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, {tv_sec=30266, tv_nsec=649307312}, NULL) = 0
# ****** No output for 5 seconds ******
rt_sigaction(SIGINT, {...}, {...}, 8) = 0
Aha, clock_nanosleep
seems like most likely candidate and sure enough if you check man pages it is exactly what we were looking for.
CLOCK_NANOSLEEP(2) Linux Programmer's Manual CLOCK_NANOSLEEP(2)
NAME
clock_nanosleep - high-resolution sleep with specifiable clock
DESCRIPTION
clock_nanosleep() allows the calling thread to sleep for
an interval specified with nanosecond precision.
...
clock_nanosleep() suspends the execution of the calling thread until either
at least the time specified by request has elapsed, or a signal is delivered
that causes a signal handler to be called or that terminates the process.
It is schedulers all the way down
A system call is already so far down in layers of abstractions that one might be forgiven for not knowing how it actually works, but where's the fun in that? So how does Linux puts your thread to sleep efficiently and then wakes it up when the time comes? Let's look at the system call again, this time with more focus on the arguments.
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, {tv_sec=30266, tv_nsec=649307312}, NULL) = 0
We put thread to sleep for 5 seconds, yet we can't see 5 anywhere here. What is tv_sec
and tv_nsec
here? They seem like some translation of our command to sleep for 5 seconds. Let's run it multiple times and see if anything changes.
λ strace -e trace=clock_nanosleep python sleep.py
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, {tv_sec=31402, tv_nsec=617991081}, NULL) = 0
λ strace -e trace=clock_nanosleep python sleep.py
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, {tv_sec=31411, tv_nsec=9811133}, NULL) = 0
λ strace -e trace=clock_nanosleep python sleep.py
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, {tv_sec=31416, tv_nsec=752812932}, NULL) = 0
tv_nsec
seems random, but tv_sec
seems to be increasing roughly 5ish seconds + random time I wasted in between running the script. So maybe that is the Linux Epoch Time? It seems similar to our mental model of sleep looping till datetime.now() + timedelta(seconds=seconds)
.
If you dig deeper, it is not Epoch time. Linux has continuously incrementing internal time called "Monotonic time" which is different from real world time. This avoids all complexities that come with real world time such as timezones and time synchronization using NTP.
So here's roughly what is happening when we put our thread to sleep:
1. Create a Timer for future time that is `current_monotonic_time + duration`
2. De-schedule the thread, so it doesn't consume CPU time.
3. When Timer expires thread is put back into run queue.
This still sounds like bunch of abstractions to me, not so different from our mental model of how time.sleep
works by busy looping. Surely checking expiry of timer will just be yet another busy loop?
Timers, Interrupts and Linux Process Scheduler
It turns out, Linux IS doing a busy loop, although in a much more efficient manner. Linux maintains a tick counter called jiffies
which is incremented 250 times a second by an onboard hardware clock using interrupts. On every increment it checks if there are pending expired timers and executes them. That feels very similar to how Frappe scheduler works at a very high level.
If I were to write it as pseudocode, here's a simplified implementation of this whole process:
HZ : Constant = 250
jiffies : Int = 0
timers : List[Timers]= []
# This is called by a hardware clock "HZ" times in a second.
def increment_jiffies():
global jiffies
jiffies += 1
run_expired_timers()
def run_expired_timers():
for timer in timers:
if jiffies > timer.expiry:
timer.callback()
timers.remove(timer)
def clock_nanosleep(seconds):
thread = current()
timer = Timer(expiry = jiffies + seconds * HZ, callback=lambda: thread.schedule())
timers.append(timer)
thread.deschedule()
So what now
This was a thinly veiled post for "why you should take an Operating Systems class". Often neglected, this is easily one of most fundamental and interesting concept that every programmer should at least have some idea about. You will learn about scheduling, isolation, virtualisation, performance, concurrency and most importantly the "magic" that makes it all work.
While it's not same as understanding the process of inventing a universe, it comes pretty close for software engineers.
I recommend these books to start with: Operating System: Three Easy Pieces (free) for theory and Linux Kernel Development for practical implementations.