Problems with the ULE Scheduler

Problems with the ULE Scheduler
Photo by Daria Nepriakhina 🇺🇦 / Unsplash

There are many CPU schedulers out there — ULE, 4BSD, CFS, and EEVDF, to name a few — but none of them are perfect. Every scheduler has areas where it excels and areas where it falls short. That's acceptable when the shortcomings stem from fundamental design trade-offs (e.g., the inherent limitations of a fair scheduler), but not when they stem from bugs or implementation oversights.

ULE, the default scheduler in FreeBSD, was introduced in FreeBSD 5.1 and became the default in FreeBSD 7.1. You might wonder why FreeBSD still ships its predecessor, 4BSD, even now at FreeBSD 15. The reason is that despite ULE being designed as a replacement for 4BSD, there are implementation issues in ULE that cause it to perform worse than 4BSD in certain scenarios. These should, of course, be fixed — and a developer has announced plans to work on this in the near future. In this post, I'll describe what those issues are and outline possible fixes.

Nice is not well-respected

In the ULE scheduler, nice values are not well-respected. The original ULE paper describes the design as follows:

A fixed part of the priority range in FreeBSD is used for time sharing threads. ULE uses the interactivity score to derive the priority. After this the nice value is added to the priority, which may actually lower the priority in the case of negative nice values.

For context, here is how FreeBSD defines its priority bands and how ULE calculates scheduling priority:

/*
 * Priorities range from 0 to 255.  Ranges are as follows:
 *
 * Interrupt threads:		0 - 7
 * Realtime user threads:	8 - 39
 * Top half kernel threads:	40 - 55
 * Time sharing user threads:	56 - 223
 * Idle user threads:		224 - 255
 *
 * ...
 */

sys/sys/priority.h

/*
 * If the score is interactive we place the thread in the realtime
 * queue with a priority that is less than kernel and interrupt
 * priorities.  These threads are not subject to nice restrictions.
 *
 * Scores greater than this are placed on the normal timeshare queue
 * where the priority is partially decided by the most recent cpu
 * utilization and the rest is decided by nice value.
 *
 * The nice value of the process has a linear effect on the calculated
 * score.  Negative nice values make it easier for a thread to be
 * considered interactive.
 */
score = imax(0, sched_interact_score(td) + nice);
if (score < sched_interact) {
    pri = PRI_MIN_INTERACT;
    pri += (PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) * score /
        sched_interact;
} else {
    const struct td_sched *const ts = td_get_sched(td);
    const u_int run = SCHED_TICK_RUN_SHIFTED(ts);
    const u_int run_unshifted __unused = (run +
        (1 << SCHED_TICK_SHIFT) / 2) >> SCHED_TICK_SHIFT;
    const u_int len = SCHED_TICK_LENGTH(ts);
    const u_int nice_pri_off = SCHED_PRI_NICE(nice);
    const u_int cpu_pri_off = (((SCHED_PRI_CPU_RANGE - 1) *
        run + len / 2) / len + (1 << SCHED_TICK_SHIFT) / 2) >>
        SCHED_TICK_SHIFT;
    pri = PRI_MIN_BATCH + cpu_pri_off + nice_pri_off;
}

sys/kern/sched_ule.c:sched_priority(), simplified

In FreeBSD, nice values range from −20 to 20 (41 levels), while the timeshare priority band spans from 56 to 223 (168 levels). The nice value is added to the priority linearly without scaling. This means the impact of nice on scheduling priority is relatively small, and the interactivity score can easily dominate the priority calculation, effectively drowning out the nice value.

One straightforward fix would be to multiply the nice value by 3 or 4 before adding it to the priority. However, linear addition has a deeper problem: the perceptual difference between nice 0 and nice 1 is much greater than the difference between nice 18 and nice 19. A proportional approach using an exponential weight table solves this by expressing nice levels as ratios rather than fixed offsets — for example, each nice level corresponds to roughly a 10% change in CPU share.

Linux already takes this approach. The CFS/EEVDF scheduler maps each nice level to a weight via a precomputed table:

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

kernel/sched/core.c

With this kind of weight table, nice values are meaningfully respected at every level, and the effect of each increment feels consistent across the entire range.

Fairness Issue

The ULE scheduler is not completely fair. To illustrate what this means: in Linux's CFS and EEVDF schedulers, total CPU usage is tracked per task via a virtual runtime (vruntime). When a task yields the CPU before its time quantum expires, the unused portion is preserved, and the task receives compensation the next time it is scheduled.

ULE does not do this. When a batch-class task in ULE yields the CPU early, the remainder of its time quantum is simply lost. If the task repeatedly yields early — but not frequently enough to be promoted to the interactive class — it ends up receiving less CPU time than it is entitled to, leading to a form of soft starvation. One approach to fixing this would be to track the lost quantum as a credit and apply it the next time the task is scheduled.

Conclusion

Recently, FreeBSD landed a patch that allows ULE and 4BSD to coexist in the same kernel, enabling users to select a scheduler at boot time without recompiling. Once fixes for the issues described above land, users who have been relying on 4BSD due to ULE's shortcomings will be able to properly evaluate ULE's performance. If ULE can be shown to be superior to 4BSD in all cases, 4BSD can finally be deprecated. To follow progress on this work, subscribe to the freebsd-hackers mailing list.