ArsTechnica Explains O(1) Scheduler 295
geogeek writes "The recent release of Linux's 2.6 kernel introduces a number sweeping improvements. These can be hard to understand without a background in programming. This week's Linux.ars examines the improved scheduler for an enthusiast audience, concisely explaining its nuances and the practical effects of an O(1) design."
A more detailed description of O() notation... (Score:4, Informative)
that article sucked (Score:1, Informative)
Re:Hmm. (Score:5, Informative)
The original article was MUCH MUCH more in depth it was toned down because of the "glaze over" effect people were getting. I didn't have much say in the editorial process in the matter.
I think this way atleast it appeals to a broader audience, not just the CS audience who would know what a binary min heap is if I were to use it.
To each his own I suppose.
Re:Multi core proc's (Score:5, Informative)
Well, only if you want to run only as many threads as you have CPUs. The windows machine I'm using right now has 37 processes, some I'm sure with multiple threads. I suppose you could have a hardware scheduler, but I don't think it would be a good idea. Scheduling doesn't take much time (certainly not this O(1) jobby), so you would lose a lot of flexibility without a lot of gain.
Interesting (Score:5, Informative)
Re:Hmm. (Score:5, Informative)
Re:Ars' Piece (Score:5, Informative)
There is one exception to this load balancing rule -- some special processes may be fixed to a certain runqueue. This attribute is called "thread affinity;" a thread is simply another name for a process.
I think it over-simplified this a bit. Granted, Linux doesn't make all that much distinction between a "thread" and a "process" as compared to (say) Windows, but the distinction is still important.
Otherwise, it's a decent article, and gave me some insight to one of the major improvements in 2.6 (which I've yet to try out).
I remember when the "pre-empt" patch came out for 2.4, before it was integrated into the kernel. For a desktop (or otherwise interactive, like my media box) system, it really made a major perceived improvement in performance (while *actual* performance wasn't affected; it was just about priorities and scheduling, making it "feel" more responsive).
The O(1) scheduling algorithm likely improves this further, and is enough to tempt me to spend XMas evening upgrading... these things are important, especially on a desktop system.
Windows XP, for example, isn't all that great in this area, and often one process can too easily slow things down. This is even further emphasized when all the processes you're working with are actually the main "explorer.exe" process; eg, you do something in Windows Explorer that blocks (spin up a CD drive) and everything else (the "desktop", task bar, etc) all become unresponsive...
I enjoy seeing such improvements in the Linux kernel. The pre-empt patch (later integrated into the mainstream kernel) made a drastic usability improvement (especially when the box is also a web server, MySQL server, firewall/router, etc) and I couldn't see a Windows box handling similar tasks without becoming unusable (note: not unstable, just not practical to use when responsiveness is hindered).
I think the kernel team is on the right track with this, specifically for desktop applications (though it does help in many other applications as well; anything interactive, really, even over the network/Internet).
Re:Hmm. (Score:5, Informative)
Re:Hmm. (Score:5, Informative)
There are two queues used in the Ingo scheduler. One contains processes that are ready to run. The other contains processes that have already run. The run queue is scanned, and when a process is found that can be run, it is automatically run without scanning the rest of the queue. When the process finishes its time quantum, it is put into the second queue.
When the first queue (the active one) is empty, the queues are switched. This is just done by swapping the pointers to the queues.
This scheduler is O(1) because it runs the first process it can run without scanning the entire queue as an O(n) scheduler would do. That's why it's O(1).
If you want a bit more information on it, search for the "Ingo scheduler" on Google or your other search engine of choice. I don't see any detailed explanations of what happens, but I suspect as the 2.6 kernel goes into wider use, those will come.
Ars got it slightly wrong about O(1)... (Score:5, Informative)
Re:Hmm. (Score:3, Informative)
Re:Deep algorithm analysis? (Score:1, Informative)
I seem to recall seeing benchmarks about O(1) in Kerneltrap, which confirmed that for a small (less than 50? I can't remember) number of processes, the old scheduler was better.
My main interest is the desktop/notebook/pda environments, though of course servers will benefit from this scheduler (due to their usually higher number of processes).
Even on the desktop, the O(1) could lead to benefits for those who load lots and lots of apps and keep them running. This is not my traditional usage today, as I turn off the computer at night.
But this can change. If anyone is interested, there's a Linux ecology howto, which presents ideas on how to make computers less power hungry.
Re:Hmm. (Score:1, Informative)
I don't follow your logic here.
Assume that you have a queue with n processes. In worst case when scanning through the queue, the first process that it can run is the last one; thus, it must check n processes. If the processes in the last are placed "in random", e.g. just put there, on average, the scanner would have to check n/2 processes before finding the correct one = O(n).
How can this be called O(1)? I'm not saying the scheduler in 2.6.0 isn't O(1) but I believe your explaination is flawed.
UNIX catches up with mainframes (Score:5, Informative)
Mainframe schedulers have been O(1) for a long time. The one for UNIVAC 1108 EXEC 8 was O(1) in 1967. (And you can still buy a Unisys ClearPath server with that code in it.)
The traditional implementation is a set of FIFO queues, one for each priority. For systems with only a few priorities, that's just fine. As the number of different priorities increases, you spend too much time checking the empty queues, so there are tree-like structures to get around that problem.
Scheduling today is more complex because of cache issues. If you context-switch too much, you thrash in the CPU caches. On the other hand, everybody has so much memory today that paging is mostly unnecessary.
Re:Deep algorithm analysis? (Score:2, Informative)
To achieve O(1) you usually have to do some tricks that will get you quite a large constant.
I dont think that is necessarily true. You can (as they are doing here) trade memory use for speed by enumerating a bucket for every possible option and having a method to decide which bucket to look in. The memory requirements change for the number of options but the speed of the decision stays at a (low constant) O(1).In general, its a matter of building clever, info-rich lookup keys and organizing the data into a data structure that has the same access time for each key.
However, from http://67.163.90.104/~jim/scheduler.html
Basically a function (find_busiest_queue) is called which will return another runqueue if there exists a runqueue that has 25%+ more runnable processes than the current process. If there is such a runqueue then the runqueues are locked (via each runqueues spinlock) and processes are transfered from the busier runqueue to the current runqueue, then the runqueues are unlocked.
Since I think people lump the loadbalancing portion of an SMP system in with the "scheduler" then clearly the overall scheduling and running of processes is not O(1).
I haven't heard of/read about Sun/HP/IBM/SCO* speaking about their SMP system schedulers and runqueue balancers separately. Anyone? Anyone?
* Yeah, that is a joke, because SCO would never talk about their scheduler (scheduler headers files, OK, yeah, maybe in an open letter to rich CIO's) since they already have the Linux 2.6 O(1) scheduler in their secret underground stash of "scanned into TIFFs and copied onto CD's" source code - Ha!
Re:Hmm. (Score:5, Informative)
Your assumption is flawed, though. You seem to be assuming that only one process in the queue is runnable. We can't make that assumption that at any given time there's only one runnable process.
I believe you have a point about the worst case of the algorithm. That's not really relevant here, though. Consider the Quicksort algorithm. It has a nasty worst case - data that's already been sorted. As a matter of fact, the worst case is an O(n^2) and approaches the slowness that is the bubble sort. Despite this well-documented worst case, we say that quicksort is an O(n*log n) algorithm, which is a more typical case.
I believe the worst case of the Ingo scheduler is O(n). If only one process can be run, it indeed has to check n/2 processes on average, and therefore is O(n).
You raise a good question, and figuring out what order an algorithm is, is something I'm not particularly good at. Generally the average case is what matters, but I don't know how you determine what the average case is for the state of processes.
Anyone care to try to explain this?
Re:Question (Score:5, Informative)
I think the majority of major improvements on the 2.6 kernel over 2.4 were placed for increasing system responsiveness.
(and yes, I'm using my Karma bonus)
Re:Ars' Piece (Score:1, Informative)
Did you every try it? ever heard of something called threads? we windows people have them since ages *eg*
Re:Preemption and disk requests - educate me (Score:3, Informative)
The application wasn't being used and so its memory was swapped to disk, now that memory is needed again so it needs to be brought back. But before it is brought back some space needs to be cleared which means writing some existing data from memory to disk (which is a slow process).
The solution is simply to have enough memory installed for your working set of applications (which may mean installing more memory, using less applications, or using more memory efficient programs).
Re:Preemption and disk requests - educate me (Score:5, Informative)
This usually happens when a process is in a loop, and repeatedly uses the same memory blocks, but can't keep them all in physical memory. Thus, it has to replace the pages every time it accesses/modifies them.
Even if a process is at a low priority...if it comes on the CPU, it has the potential to trash memory and slow the works down.
If an application is trashing, then you might be able to get away with scheduling it to not hit the disk. If the OS is trashing, you are SOL.
Re:Question (Score:5, Informative)
Let's see if I can remember anything from my algorithms class.
How is this not O(n)?
Read on for an explanation. I do believe you gave the answer to your own question in your post. I will also assume that you understand "Big-O" notation well enough to follow the answer.
Each process still gets its time slice calcuated, it is removed from one queue, and inserted into another.
Correct. You've nailed it. Insertion into a Que operates in O(1) time. Removal from the Que is also a constant time algorithm. A Que inserts at the tail of the list (or the end of the array) and De-Ques from the Head of the list (or the beginning of the array). Ques do not permit searching, sorting or accessing the data after the first element.
If we De-Que a process, run it then calculate some information and Que it up that would run in O(c1*O(1)+c2*O(1)) = O(1).
This is still an oversimplification and does not take into account the possibility of a Priority Que in which insertion runs in O(lg n) and De-Queing runs in O(1). So a heap sort using a Priority Que runs in O(n*lg n).
Hope that helps.
Re:A question if anyone can answer (Score:2, Informative)
There are 140 possible priorities. Per processor. Each priority has its own pair of queues -- runnable-and-not-yet-run (active), and runnable-but-waiting-for-more-time (expired). (If a process isn't runnable, it doesn't even appear in the run queue; already an improvement over 2.4). In the 2.4 scheduler, the system task list WAS the runqueue list, and you had to inspect it, from beginning to end, holding a lock, to find the next best process to run. As the system got busier, guess what -- that search got longer and your holding of the lock had a bigger, cascading effect. With multiple cpus, you might finally have multiple processors decide on the same process ... then the first one grabs it and n-1 processors would have to go back and find the next best choice. Maybe they would conflict again. In 2.6, with 140 priorities often your queue of best processes is very short because you have so many places to put runnable processes, and on a per-processor basis no less.
But let's take a worst case example -- a heavily loaded system. You've got 1000 processes, but a lot of them are daemons waiting for something to happen. Still, as luck would have it, you have 200 processes runnable and waiting to run because your web server just got hit, and 4 processors handling them. Let's further assume that due to bad luck, all of them are stuck on only one of those cpus. And let's slap them with incredibly bad luck and place them ALL at the same priority, which is unlikely but certainly possible. Now that O(1) flies out the window, right? because you have 200 processes in one queue, just like before and now you have to look at all of them to figure out who runs next, right? Wrong. It's a FIFO linked list, and since you know they're all the same priority, you do the fair thing and take the one at the head. Took you a mere blink to determine the highest priority queue that had something (find first bit over a bitmap for the queues) and then a simple assignment to take the first thing off the list and select the fairest choice. As these processes use up their timeslices they are placed on the expired queue, again in FIFO order, and when the active is empty, with another simple assignment the expired becomes the active.
What's the down side? Well, the 2.4 scheduler had one queue (the tasklist) for all the cpus, and that meant it was self-balancing. You didn't end up with "200 processes on one processor"; as the processors needed processes they were handed one from The One Queue. However, in 2.6, a process doesn't randomly wander from its processor's queue. You do need to occasionally go move processes from processor to processor to insure they're not waiting too long for their turn. Otherwise, in the worst case, one processor has all of the runnable processes (all but one waiting) while the other processors sit idle.
So to summarize, under heavy load, 2.4's scheduler spent more and more time agonizing over which process to run, when you really needed it to be more and more efficient. 2.6 spends the same time under heavy and light load, which means more of the cpu is available to actually do the work. The effect is similar to removing three layers of middle management :)
Re:Hmm. (Score:3, Informative)
Therefore, if Ingo's scheduler really takes time proportional to the number of processes in the worst case, no matter how unlikely that case is, it is incorrect to say that it is O(1).
Re:Question (Score:2, Informative)
Depends on whether you're measuring the time to select another process to run, or the overhead necessary to do correct scheduling. Timeslice recalculation must happen, that's true. But on 2.4, it needed to happen in conjunction with other information gathered from the task list. There was one queue: the system task list. And to safely access that you need to take a read side of a read/write spinlock. Seems harmless enough, until you consider the side effect that a read lock will block a write lock from being acquired. So every time the scheduler has to inspect that list, it may be stopping some other part of the scheduler (possibly on another processor) from implementing some decision. In the worst case, you hold a read lock long enough that other processors' decisions become invalid. Once they acquire write permission they realize that the process has died or gone to sleep or is no longer a good choice for other reasons and then they must go back, grab a read lock (!) and search again. Thus blocking others who have reached decisions, etc.
2.6 (O(1)) significantly reduces lock contention. That alone accounts for tremendous savings under load. The cost is a little more memory and moderately more complexity. In this case, it is a very good trade.
Re:Ars' Piece (Score:4, Informative)
Uneducated guess... (Score:3, Informative)
...but wouldnt' the fact that only one process is operated on at a time (i.e. a fixed amount of computing) as opposed to scanning the enitre list of processes (could be one...could be thousands) each time make the difference between O(n) and O(1)? Sure, things will be slower if you have 10,000 things to check as opposed to one, but at least here we're not checking 10,000 things for each and every one in the list...
(IANAKH)
Re:Preemption and disk requests - educate me (Score:5, Informative)
In your scheme, the low priority process will never get a chance to run, because the high priority process is running all the time. So the high priority process keeps running forever doing nothing useful.
So if your email client was checking mail in the background, its connection will time out, file sharing will completely die, background printing fails due to communication time-outs,explorer.exe == bad design outside of the kernel (Score:5, Informative)
Unfortunately, even the best kernel process scheduler in the known universe would not fix this design flaw in Microsoft Windows, because what you're seeing is not a thread/process scheduling problem.
As you correctly observed, many Windows programs are forced to make "shell calls" -- which doesn't mean the same thing it does in Unix, where you fork a
The reason it bogs down when it's busy is because it is waiting on a single event queue. Mounting/unmounting of media, network lag, other processes sending/receiving messages, etc. all give EXPLORER.EXE more events to wait on. That's why it's a bottleneck, even when there's plenty of CPU to go around.
It's an awful design, and it's one of those things that's fundamentally broken about Windows.