Forgot your password?
typodupeerror
Linux Software

The Completely Fair Scheduler 292

Posted by CmdrTaco
from the i'm-always-late dept.
hichetu writes "Kernel trap has a nice summary of what is going on behind the scenes to change the Linux Scheduler. The O(1) Linux scheduler is going to be changed so that it is fair to interactive tasks. You will be surprised to know that O(1) is really too good not to have any side-effects on fairness to all tasks."
This discussion has been archived. No new comments can be posted.

The Completely Fair Scheduler

Comments Filter:
  • by Anonymous Coward on Sunday April 22, 2007 @11:33AM (#18831995)

    I thought Linux used Cron as a scheduler ?

    • by ozamosi (615254) on Sunday April 22, 2007 @11:37AM (#18832027) Homepage
      That is for scheduling background tasks that run once a day (or whatever you set it to)

      This is for scheduling CPU resouces in real time. To decide if Firefox or Apache is going to be executed the following split second.
      • by LiquidCoooled (634315) on Sunday April 22, 2007 @11:51AM (#18832103) Homepage Journal
        Can't we just give the processes weapons and let them decide which follows?
        • by sphealey (2855) on Sunday April 22, 2007 @12:02PM (#18832197)
          > Can't we just give the processes weapons and
          > let them decide which follows?

          That is actually the kind of question that my Operations Research professor (who also did a lot of work in CPU simulation and performance estimating) used to throw onto final exams as the "separate the B+ from the A" question. If your answer was interesting enough he would send you over to one of his Masters candidates to see if it could be taken any further. So I wouldn't count your suggestion out from the start!

          sPh
          • by mosel-saar-ruwer (732341) on Sunday April 22, 2007 @12:32PM (#18832419)

            GP: Can't we just give the processes weapons and let them decide which follows?

            P: That is actually the kind of question that my Operations Research professor (who also did a lot of work in CPU simulation and performance estimating) used to throw onto final exams as the "separate the B+ from the A" question. If your answer was interesting enough he would send you over to one of his Masters candidates to see if it could be taken any further. So I wouldn't count your suggestion out from the start!

            Behold: The Mother of All Possible Comp Sci Flame Wars: The Darwinistically Selected Genetic Algorithms -vs- the Intelligently Designed Algorithms.

            Bumper Stickers $4.95; T-Shirts $19.95:

            $DEITY does not play dice with the Turing Machine.
          • > Can't we just give the processes weapons and
            > let them decide which follows?

            That is actually the kind of question that my Operations Research professor (who also did a lot of work in CPU simulation and performance estimating) used to throw onto final exams...
            Operations Research being closely related to Industrial Engineering, your prof proves the old joke about how IE's are Imaginary Engineers.
          • by theonetruekeebler (60888) on Sunday April 22, 2007 @03:41PM (#18833671) Homepage Journal
            Okay -- Since I'm not allowed to drink beer in class I'll just have to post this from home.

            Want to give each process a weapon? Fine. But they have to earn ammunition.

            Every time a process gives up its slot, it's given a round of ammunition. It has the option of "shooting" a process ahead of it in the queue, thereby expending a round of ammunition. A shot process must give up its slot in the next round. Whether it loses all its ammo when it respawns remains a research question.

            There are two floating point tunable parameters, "accuracy" and "rampage." "Accuracy" is the likelihood that a given shot will actually hit the process it aims at. "Rampage" is the tendency of a process to save up rounds for a while then go on a spree.

            Okay, there's a third parameter, "armor," which is the odds of a hit actually becoming an injury. This is meant to protect system processes against luser jobs, and top-level processes against spawned threads.

            Of course, the scheduler itself is a boss job that can't be killed, has perfect armor and has infinite ammo.

            For the purpose of top and other job monitoring tools we can replace a process's "NICE" score with a "VIOLENCE" score -- an aggregate of their armor, accuracy, rampage tendencies and current ammo supply. We can rename the renice utility to medicate. The important thing about medication is that it eventually wears off, unless you specify the -l (lobotomize) option, which turns the process into a harmless drooling vegetable. Its companion utilities are aim and armor, which tune a job's accuracy and armor class, respectively.

            There are two important things about this approach. First, it's probabilistic instead of purely heirarchical. Second, it should give Jack Thompson the screaming heebie jeebies. In fact, I'm going to call this the JTMS scheduler -- the Jack Thompson Murder Simulator Scheduler.

            I'm sure this concept can be explored further, but the bar's about to close.

            • by kcbrown (7426) <slashdot@sysexperts.com> on Sunday April 22, 2007 @07:32PM (#18835337)

              For the purpose of top and other job monitoring tools we can replace a process's "NICE" score with a "VIOLENCE" score -- an aggregate of their armor, accuracy, rampage tendencies and current ammo supply. We can rename the renice utility to medicate. The important thing about medication is that it eventually wears off, unless you specify the -l (lobotomize) option, which turns the process into a harmless drooling vegetable. Its companion utilities are aim and armor, which tune a job's accuracy and armor class, respectively.

              Of course, with such a scheduler, something like the Doom system administration tool [unm.edu] (perhaps more like Quake where you can aim vertically as well as horizontally) will become the preferred method of managing the processes on a system.

              For one thing, the processes will obviously shoot back, as the process manager itself (which you see as yourself when running it) is a running process, and thus subject to being fired upon by the other processes.

              Secondly, a headshot obviously gets you a "lobotomize" effect. This could pose a problem if one of the other processes hits you with a headshot...

              Finally, the application of a medpack to an injured process invokes the "medicate" action.

              There are a few possible problems with this, of course:

              1. When you have two or more system administrators, all running the process manager, the system itself becomes a warzone with innocent processes being killed by the dozen as the administrators go on rampages in their attempts to kill each other for supreme control of the system.
              2. Certain weapons, such as the BFG, are powerful enough to take out all but the most heavily armored processes, and since some of them are area effect weapons, a lot of innocent processes will bite the dust as a result of their usage.
              3. Lightly-armored processes will need additional protection in the form of fast reflexes to avoid being hit.
              4. Eventually the administrators will begin using aimbots and the like. One can see where the resulting arms race will go. Obviously the aimbots will have to run on a different system since otherwise they'll be potential targets.
              5. "Spawn camping" takes on a whole new meaning. Newly created processes become very vulnerable compared with running under earlier versions of Linux. Normal users will have an increasingly difficult time starting tasks like OpenOffice and will start to migrate back to Windows or other OSes with clearly inferior schedulers.
              6. Due to all of the above, the system will eventually become unusable by anyone but the system administrators. The sysadmins will, of course, say that this is how it should be.

              In short, Linux will quickly become the must-have operating system for gamers, but at the expense of the general purpose desktop.

            • Re: (Score:3, Funny)

              by Syberghost (10557)
              Of course, the scheduler itself is a boss job that can't be killed, has perfect armor and has infinite ammo.

              Fascist!
          • It's all data structures and interrupts nowdays laddie! Data structures and interrupts. Why back in my day we had nice chrome-plated algorithms and fat polling intervals and liked it!

            You pesky young folk who think yer flash ram is better than good old reliable ferrite donuts have at with ye! We had disk drives that could pull yer filings out from two feet away on a max seek. Do ya get technology like that nowdays? Do ya? Nuuuuu.....

            Mind ye what we had for an OS scheduler was a smiling polish gennelm

        • by HerrEkberg (971000) on Sunday April 22, 2007 @12:05PM (#18832227) Homepage
          Just throw this [unm.edu] into the kernel and we are good to go.
          • Good job sending all those /.ers over there, it will be the mother of all process fights on that poor server now. Sysadmins battling their way through hordes of zombies and monster processes, with ammo (ehm.. mem,cpu) running lower and lower until they're out, just as another wave of uglies comes out of nowhere...

        • Exokernels hand over a lot of the scheduling to the processes themselves.
        • by Progman3K (515744) on Sunday April 22, 2007 @12:41PM (#18832489)
          Yes, that's called psDoom
          http://psdoom.sourceforge.net/ [sourceforge.net]
        • by Lars T. (470328)
          So the first scheduled process kills -9 all others - makes scheduling and multi-tasking kinda pointless.
  • Fair? (Score:5, Funny)

    by alienmole (15522) on Sunday April 22, 2007 @11:37AM (#18832029)
    If scheduling was completely fair, this would have been a frist ps0t.
  • by 0xdeadbeef (28836) on Sunday April 22, 2007 @11:40AM (#18832039) Homepage Journal
    Free software: because some processes are more equal than others.
    • by Anonymous Coward on Sunday April 22, 2007 @11:56AM (#18832135)
      Yes, each process according to its needs,
      each CPU according to its abilities.
      • by Kjella (173770)
        And for the capitalist owning the CPUs and giving the processes work: Outlook not so good. Fortunately, our slave laborers are very loyal.
  • Surprised? (Score:5, Funny)

    by WombatDeath (681651) on Sunday April 22, 2007 @11:43AM (#18832055)
    You will be surprised to know that O(1) is really too good not to have any side-effects on fairness to all tasks.

    No I won't, because I don't know what the hell it means.

    Hah! In your face, Taco!
    • Re:Surprised? (Score:5, Informative)

      by Eevee (535658) on Sunday April 22, 2007 @11:50AM (#18832101)
      In computer science, Big O notation [wikipedia.org] is used for the complexity of a task.
      • by BiggerIsBetter (682164) on Sunday April 22, 2007 @12:21PM (#18832323)
        In computer science, Big O notation is used for the complexity of a task.

        Meanwhile, outside computer science, Big O faces are used for the completion of a task.
    • Re:Surprised? (Score:5, Informative)

      by jrschulz (684749) on Sunday April 22, 2007 @12:06PM (#18832235) Homepage
      A scheduler is the piece of software that brings you the illusion of multi-tasking. Because a single processor (with a single core) can only run one process at the same time, the operating system switches the process currently running. And it does this very fast (IIRC up to 1000 times a second in the case of linux).

      The scheduler decides which process runs when and has to make sure that no process has to wait in the queue forever without getting his share of CPU time (this is what is called "starving").

      Since the scheduler is a program by itself, it has a specific runtime characteristic, usually dependent of the number of programs waiting for their CPU share. The special property of the current scheduler in linux is that its runtime is in fact independent of this number. That's expressed in CS by O(1).
  • by Anonymous Coward
    Will the new scheduler be more efficient at scheduling single processes across a multi-core system?

    On another not[e], when are we going to be able to build a default kernel capable of low latency I/O for A/V work?
  • I/O prioritisation (Score:5, Interesting)

    by Doug Neal (195160) on Sunday April 22, 2007 @11:47AM (#18832069)
    Linux really doesn't need a new process scheduler. What it could really do with is I/O prioritisation. Windows now has it, so there's no excuse. CPU power is fairly abundant these days so managing its usage is less of an issue than it used to be, but I/O bandwidth is often in short supply and I/O-bound applications can choke a system and make interactive processes a pain in the ass to use. I'd like to see some way of reserving and limiting bandwidth to particular devices for particular processes. And an equivalent of "top" for monitoring processes' I/O activity would also be extremely handy... as far as I know, the system calls don't even exist in the kernel to do this yet.
    • by tepples (727027) <.moc.liamg. .ta. .selppet.> on Sunday April 22, 2007 @11:54AM (#18832123) Homepage Journal

      And an equivalent of "top" for monitoring processes' I/O activity would also be extremely handy... as far as I know, the system calls don't even exist in the kernel to do this yet.
      Windows has this to an extent. I press Ctrl+Shift+Esc to open Windows Task Manager, then View > Select Columns... > turn on Page Faults Delta, and I see the penalty for sticking with the 6-year-old paid-for PC that I still use.
      • by jawtheshark (198669) * <slashdot&jawtheshark,com> on Sunday April 22, 2007 @12:31PM (#18832405) Homepage Journal

        Now, page faults are indeed a form of I/O, but a page fault is technically seen just the fact that some memory required isn't in physical memory. I don't think the parent poster was talking about that. One of the most common reasons for page faults are simply that a block of memory has been swapped to disk, and then suddenly it is required, and as such the block of memory needs to be read into the physical memory.

        I'd say: add some memory to that box of yours.

        You can read up on it here [wikipedia.org]

        • I'd say: add some memory to that box of yours.

          That assumes that the loader is going to try to bring the whole program into memory at load time, which is almost never the case. It completely ignores the notion of working sets [wikipedia.org], or how they change over the lifetime of a process. As an example, why should a program keep its "open all files" code in memory once all the files are opened? Vomit Making System loaded the first page of a program (the entry point) and forced the program to demand in other (512 byte

    • by Anonymous Coward on Sunday April 22, 2007 @11:57AM (#18832149)
      > What it could really do with is I/O prioritisation.
      ionice [die.net]. Available since 2005-08-28.

      > And an equivalent of "top" for monitoring processes' I/O activity would also be extremely handy

      I agree, that would be nice.
      • by ppc_digger (961188) on Sunday April 22, 2007 @12:37PM (#18832445)

        I agree, that would be nice.

        Don't you mean "that would be ionice"?
      • by Kjella (173770)
        I see ionice isn't installed by default on Debian etch, but found it in the schedutils package. However, I still don't see how to use it in a useful way. For example, I would like to load kplayer with high priority on both CPU and IO, every time I run it (i.e. every time I start the process). I understand I can dig out the process ID, open up a terminal, renice and ionice it, but is there any way to make it run that way by default?
    • Re: (Score:3, Informative)

      by hackstraw (262471)
      Linux really doesn't need a new process scheduler. What it could really do with is I/O prioritisation. Windows now has it, so there's no excuse.

      I dunno. Linux has has some changes recently in the scheduling department, and an O(1) process scheduler can only be "a good thing". Recently, the I/O block layer got a new scheduler (linky http://kerneltrap.org/node/7637 [kerneltrap.org]). Regarding other I/O prioritization, I can't say with authority that this is needed or not.

      Maybe all of these things are related, but in my se
      • Re: (Score:3, Informative)

        by makomk (752139)
        Linux has has some changes recently in the scheduling department, and an O(1) process scheduler can only be "a good thing".

        Which, presumably, is why they're thinking of taking out its current O(1) process scheduler and replacing it with an O(log(n)) one?
      • Re: (Score:3, Interesting)

        by asninn (1071320)
        There's some folks working on scaling [lwn.net], too. ;) I wonder where I'll get a 4096 CPU machine, though...
    • by JohnFluxx (413620)
      I'm the author of the kde task manager thing. It would be interesting to add a view for I/O. If you find out how to do it code wise, let me know :-)
    • by Stephen Williams (23750) on Sunday April 22, 2007 @12:28PM (#18832379) Journal
      And an equivalent of "top" for monitoring processes' I/O activity would also be extremely handy

      I'd love something like that.

      There's a way of logging I/O; it's pretty rough-and-ready, not really suitable for permanent use, but can be handy for figuring out what keeps causing a laptop HDD to spin up, for example. As root, do:

      echo 1 >/proc/sys/vm/block_dump
      I/O is then logged to the kernel ring buffer, and can be retrieved with dmesg. The entries look like:

      pdflush(138): WRITE block 1161864 on dm-4
      pdflush(138): WRITE block 0 on dm-3
      pdflush(138): WRITE block 524328 on dm-3
      pdflush(138): WRITE block 786952 on dm-3
      pdflush(138): WRITE block 786960 on dm-3
      When you've finished, do

      echo 0 >/proc/sys/vm/block_dump
      as root to turn it off again.

      Like I said, very rough-and-ready, nowhere near as nice as a proper I/O top would be, but there it is.

      -Stephen
      • Re: (Score:2, Informative)

        by Mike McTernan (260224)
        iostat is probably an easier way to achieve similar output.

        It still doesn't break it down by task though, so not really the iotop people are looking for.
      • by Handyman (97520) on Sunday April 22, 2007 @04:58PM (#18834277) Homepage Journal
        In fact, this functionality is pretty darn useless. (I should know, I got it into 2.6. :-) ) For read I/O it's okay, but for write I/O it sucks. The thing is, what's logged is the actual I/Os, which are generally done by pdflush (if the modifying process doesn't use fsync, which it usually doesn't), and kjournald if you're using ext3. The only thing that's logged otherwise is that a process dirties an inode, but that doesn't tell you how many pages it's modified. You get something like

        process syslogd dirtied inode daemon.log
        process Y dirtied inode some_other_file1
        process Z dirtied inode some_other_file2

        followed by 300 writes by pdflush, which only specify a device and a block number, not a file name. There's no way you can find out for each of the 300 writes whether it was caused by the "syslogd", X or Z process. So there's no way you can count the amount of write I/O that a process has done.
    • Re: (Score:2, Informative)

      by Anonymous Coward
      Linux has had I/O prioritization for a few years now. There are a couple of I/O schedulers adapted to different workloads (deadline, anticipatory, fair, no op) that can be set globally or per block device. The default scheduler since 2.6.18 is cfq (the "fair" one) and it supports manual prioritization too (using the "ionice" command).

      In fact, I don't think Linux has ever suffered from the same I/O problems that people often complain about in Windows.
    • by jhines (82154)
      If i remember correctly VMS used to give a job a temporary priority boost, upon completion of the IO for just this reason, to get it to the top of the cpu queue and get it going again.

    • by DaleGlass (1068434) on Sunday April 22, 2007 @12:31PM (#18832409) Homepage
      atop seems to have support for disk monitoring, but it requires a kernel patch:

      http://www.atconsultancy.nl/atop/kernpatch.html [atconsultancy.nl]
    • Re: (Score:2, Interesting)

      by Anonymous Coward

      I read an article somewhere on Sun's site about I/O scheduling in SunOS, but unfortunately I can't find it now. I'll try to summarize it. The basic point was that when a task wants to write data, it can typically hand that data off to the kernel and then keep running. That means that it's making best use of system resources, since it will use the processor (for example) while the kernel writes to the disk in the background. When a task wants to read data, however, it usually can't do anything until the

    • by Animats (122034) on Sunday April 22, 2007 @01:24PM (#18832747) Homepage

      Linux really doesn't need a new process scheduler. What it could really do with is I/O prioritisation.

      QNX has that, which is essential for real-time work.

      QNX has the advantage that I/O, like almost everything else in QNX, is done via inter-process message passing operations. The message passing system uses priority queues, and so requests to file systems and devices get handled in priority order. So resource managers (file systems, device drivers, etc.) don't have to explicitly handle priorities; it's done for them. Some resource managers, like disk handlers, process multiple requests at a time so they can reorder them to optimize access, but network devices and such are FIFO at the resource manager level and priority ordered at the message level.

      The end result is that you can compile or surf the web on a system that's also doing real time work without interfering with the real time tasks.

  • by phrasebook (740834) on Sunday April 22, 2007 @11:49AM (#18832085)
    This new scheduler may have 'Ingo Molnar' written all over it but I'll be giving the credit to ck!
    • CFS vs SD isn't decided yet. Con is still releasing new versions as the people testing it find new bugs. Not to mention that SD is in the mm- tree and CFS isn't (yet)
    • by jb.cancer (905806) on Sunday April 22, 2007 @01:20PM (#18832723)
      Well, yes & no. It really was Con's RSDL that got ppl looking seriously into changing the mainline scheduler (there already being several out-of-mainline alternatives like nicksched, etc.)

      Con's scheduler seemed to work better at higher workloads than the mainline, by just trying to distribute load evenly and not trying pretty interactivity tricks. But several ppl did say it didn't perform well for certain X client workloads. That's when Ingo's CFS was posted.

      There really is 2 alternatives Ingo's CFS & Con's SDL that's being simultaneously tested by the kernel developers now, and none is accepted into mainline.

      So it wouldn't be fair to say that CFS is *the* next Linux scheduler. It could be SDL as well.
  • by suv4x4 (956391) on Sunday April 22, 2007 @11:51AM (#18832107)
    It's a very interesting phenomenon.

    After enough number of iterations trying to optimize a software program to do everything very well compared to the base "naive" solution, you end up with an OS that does everything poorly.

    It's counterintuitive, but we see it it every day around us.
    • What are you smoking? This is supposed to be an improvement on the O(1) scheduler which was an improvement on the previous scheduler.

      I don't know about you but my 2.6 based box runs just fine, certainly no slower than my previous 2.4 boxes, and it packs a lot more features, drivers and other internal changes/improvements.

    • Re: (Score:3, Insightful)

      by Chandon Seldon (43083)

      After enough number of iterations trying to optimize a software program to do everything very well compared to the base "naive" solution, you end up with an OS that does everything poorly.

      That's blatantly false. Sure, there are tradeoffs. There are also cases where a better algorithm is an outright win 100% of the time.

  • I would love to know how the algorithm actually works. Is this just a new data structure for organizing the processes, or is there other stuff involved?
  • by Anonymous Coward on Sunday April 22, 2007 @12:03PM (#18832211)
    On Multics, tasks that used up their CPU were put in a lower priority run queue and tasks that didn't use up their CPU time were put into a higher priority run queue.

    So interactive tasks naturally floated to the top and compute bound tasks naturally sank.

    (Anyone remember Multics?)
  • FTA:
    >> [...] the core scheduler got smaller by more than 700 lines:

    Someone who I respect tremendously once said
    "Perfection is attained when you remove everything that is not essential"

    I have to agree
    • by Paul Crowley (837)
      Antoine de Saint-Exupéry:

      Il semble que la perfection soit atteinte non quand il n'y a plus rien à ajouter, mais quand il n'y a plus rien à retrancher.

      Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
  • by tji (74570) on Sunday April 22, 2007 @12:36PM (#18832437)
    No, I think I'll wait for the unbelievably fair scheduler, or perhaps the ridiculously fair scheduler.
  • Interactive tasks (Score:4, Insightful)

    by Zarhan (415465) on Sunday April 22, 2007 @12:56PM (#18832601)
    For this reason, I've been using Con Kolivas' patches to replace the scheduler. http://members.optusnet.com.au/ckolivas/kernel/ [optusnet.com.au] - very helpful especially if you don't have the fastest computer around. Also seems to help a bit with I/O - if my hard drive is trashing for whatever reason, interactive stuff still remains reasonably responsive. Or at least it doesn't make my mouse cursor skip...

    Even so, I'd prefer to have IO better scheduled - ionice doesn't really seem to work at least for me.
  • by Anonymous Coward
    There's a huge thread about broken disk IO in kernels > 2.6.17.
    http://forums.gentoo.org/viewtopic-t-482731.html [gentoo.org]

    How about fixing that first?

    I've noticed all kinds of new schedulers coming out of the woodwork in 2.6. Optional kernel preemption. And also changes to the default timer (250 instead of 100, etc.). These are choices like I've never had in Linux before. Should be good, right?

    But at a higher level there's been such brokenness.
    K3B CD burning broke in 2.6.8 and wasn't fixed until 2.6.10.
    Firewire
  • You will be surprised to know that O(1) is really too good not to have any side-effects on fairness to all tasks.
    No I won't. AFAIK, O(1) is generally too good for any algorithm that does something useful, given the size of your input varies (n processes). With an input of size n, O(n^k) is generally (i.e. not considering a specific algorithm) acceptable, O(n) is generally good, O(log n) is generally very good and O(1) is generally suppa-duppa doubleplusgood.
  • by hcdejong (561314) <hobbes@NOSPAM.xmsnet.nl> on Sunday April 22, 2007 @02:12PM (#18833055)
    Klingon multitasking systems do not support "time-sharing". When a Klingon program wants to run, it challenges the scheduler in hand-to-hand combat and owns the machine.

    (from here [sdf-eu.org])
  • by Animats (122034) on Sunday April 22, 2007 @03:18PM (#18833499) Homepage

    I'm impressed. I did some work on CPU dispatching on a mainframe system in the distant past, and we were never able to beat O(log n) on an ordered dispatch queue. The obvious implementation is O(n); getting to O(log n) requires a tree, and I want to see how they got to O(1).

    This stuff really matters now. If we're going to have 80-CPU multiprocessors, all the main operating system algorithms have to be near O(1), or the scale-up is a lose.

    • Re: (Score:3, Informative)

      by Anonymous Coward
      They cheated.

      Internally, the kernel only has five priority levels. Each one is a queue, and it compares among all of them to determine which task to run, but it only compares the head of the queue. So it's O(5), which is of course O(1), but if it supported an arbitrary number of priority levels (which IMO it should) then it would become O(n) again.
    • Re: (Score:3, Informative)

      by Watson Ladd (955755)
      They cheated. Instead of having a real priority queue Linux places each process into one of a bunch of smaller queues each one with a different priority. Then scheduling requires looking a the processes that are at the head of the fixed number of small queues.

Chemist who falls in acid is absorbed in work.

Working...