Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Linux Software

Linux 2.2 and 2.4 VM Systems Compared 225

Derek Glidden writes "I got sick of trying to figure out from other people's reports whether or not the 2.4 kernel VM system was broken or not, so I decided to run my own tests, write them up and post them online. The short conclusion is that the 2.4 VM rocks when compared with 2.2, but there's more to it than just that."
This discussion has been archived. No new comments can be posted.

Linux 2.2 and 2.4 VM Systems Compared

Comments Filter:
  • Better but bad (Score:2, Insightful)

    by Chocky2 ( 99588 )
    2.4 VM is, IMO, a significant improvement over the 2.2 VM, but completely rewriting something as important as VM management is intrinsicaly risky and it's difficult to predict with even the slightest confidence many of the consequences of such a change. This sort of thing should be left for major revisions.
  • BSoD (Score:3, Insightful)

    by parc ( 25467 ) on Friday November 02, 2001 @03:05PM (#2513300)
    From the article:

    These were pretty uninteresting - just sitting there watching the kernel compile. Except that at one point, while running the 2.4.13 kernel, the hard drive started grinding away with the drive light pegged on continuously, the display became extremely sluggish and quickly froze up entirely, and about ten minutes later, the hard drive light went off but the machine remained unresponsive, requiring a hard reboot. I don't think this was related to anything I was doing as I wasn't actually doing a compile run at the time - probably just a random occurance, but worth mentioning.

    So the machine essentially BSoD'd, but it's not interesting?

    • Re:BSoD (Score:1, Informative)

      by Anonymous Coward
      BSODs happen right away. They don't put you through that kind of torture. Also, they give you a nice stack trace for informational purposes.

      If that had been a coredump, it probably could have been helpful, but since they don't even have a kernel debugger for Linux yet, these kinds of occurrences are just brushed away as "random occurance"s.
      • You would be surprised what you can do with the call trace of an oops and a list of kernel symbols.
        No need for a debugger in the kernel.

    • by matty ( 3385 ) on Friday November 02, 2001 @03:41PM (#2513469) Homepage
      ...is that Linux's warts are fully out in the open for all to see. Microsoft would never admit to such failings openly, even though anyone who has used Windows extensively is painfully aware of them.

      And it's been my experience that you don't hear, "Linux never crashes" that much anymore. At least I don't say it anymore, whereas I used to. I would still say that a properly configured Linux box is more stable than any Windows box, but I've had my share of lockups. (on the desktop anyway. You'll notice my server has been up for 140+ days. The last reboot was when the power supply died [it's a patched together P166] which interrupted 243 days uptime)

      All the mailing lists are public, and all of Linux's problems are there for anyone to see. This allows people to make truly informed decisions about which version of Linux to use, or whether to even use it at all. (Yes, of course these things are also true of *BSD) The current issues are why I still run 2.2.19 on my servers, since none of them get anywhere near enough load to need the newer VM's. "Stable" is definitely a relative term.

    • Re:BSoD (Score:2, Informative)

      by sbrown123 ( 229895 )
      Actually, its probably something very simple: EnergySaver. Computer went into sleep mode which I have seen lock Linux up before.
    • Please mod this down, the guy is stupid. The article says:

      These were pretty uninteresting [...]Except that at one point[...]

      That means that the guys says that the BSoD(as you like to call it) was interesting.
    • by Puk ( 80503 )
      Um... what?

      These were pretty uninteresting - just sitting there watching the kernel compile. Except that at one point, ...

      So the machine essentially BSoD'd, but it's not interesting?

      It seems to me he said that they were uninteresting, except when it BSODed -- which was interesting.

    • Windows users don't find BSOD's interesting. They happen all day long. Why should we? :)

      Seriously, even I have seen Linux die.. it would have been interesting if it kept happening, but it clearly states it was a one-time event.

    • Worse is the information that when he loaded his system up, he got repeated "parse error" messages from GCC. That's unacceptable, and needs to be understood. He suggests that it could be a transient hardware error, but since he got the same error more than once, that's unlikely. The other possibilities are that the VM manager corrupts memory under overload (very bad) or that GCC ignores some reported out-of-memory condition (also very bad.) This needs to be diagnosed.
  • by b-side.org ( 533194 ) <bside.b-side@org> on Friday November 02, 2001 @03:06PM (#2513304) Homepage
    it goes like this -

    the 2.2 VM 'feels' better for single-user use (which i disagree with) but falls down under 'heavy' load. (which, as i've pushed 2.2 to load avgs above 250, i also disagree with)

    but, anyways, that's what he's saying. i found 2.4 to be much nicer in the one userland task that frequently shows off the VM - mp3 decoding under load. 2.4 never, ever skips, 2.2, with or without ESD, skipped frequently.

    • YMMVGV (Score:2, Interesting)

      by Tailhook ( 98486 )
      Your observations run counter to the continuous stream of reports of high latency in 2.4 found in the linux kernel mailing list. Specifically, skipping mp3 playing is the canonical report. 2.2 is often cited as being the less latent of the two.

      I don't claim you're wrong. I point this out only to illustrate the subjectivity and lack of real data involved in these anecdotal reports. At the very least, the author has attempted to produce hard data on the matter.

      Linus obviously thought poorly enough to of original 2.4 VM to space the mess.

    • VM load and system load are two very different things. You can have 250 processes blocked on a floppy disk read and run your load to 250, but try having a bunch of processes and the kernel compete for the last block of memory, especially networked apps where your network card driver all of a sudden needs contiguous blocks of memory in a heavily fragmented system and watch the difference between 2.2 and 2.4.

      One more thing to note is the VM != the scheduler. The scheduler is what hands out CPU time slices to programs and ensures your mp3 decoder doesn't skip if it's been using a lot of CPU for some period of time. The VM is what manages memory allocations and decides what to page out and page in to and from disk.

      Really, there should be very little difference between VMs unless you are in a low memory condition. Now there is some difference when you consider cached disk pages, but if you are just running a mp3 decoder, I don't think you are constantly re-executing it over and over and even if you were, as long as you aren't in a low memory situation, both VMs should do basically the same thing.
  • by Azog ( 20907 ) on Friday November 02, 2001 @03:08PM (#2513313) Homepage
    If anyone out there has been having problems with 2.4 vm's (and there have been some problems) you should give 2.4.14-pre7 a try. Things have been moving fast on this front for a while now, but Linus thinks it's pretty much there now.

    In his words, "In fact, I'd _really_ like to know of any VM loads that show bad behaviour. If you have a pet peeve about the VM, now is the time to speak up. Because otherwise I think I'm done.

    This is an experimental patch to 2.4.13, and you shouldn't run it on an important machine, but the VM by all accounts is much improved.

    Even Alan Cox (who has been maintaining the older Rik Van Riel version of the VM in his -ac patches) agrees that the new VM is faster and simpler, and he plans to switch to it as soon as it is reliable enough to pass his stress testing. (which should be really soon, it seems.)

    (Yes, I spend an hour a day reading the kernel mailing list.)
  • Anyone have a link to the problems aluded to in the article with early 2.4 kernels? I've got a machine at work that swaps quite a lot and is running 2.4.2 ... I know it's due for an upgrade, but I'd like to know a little more specifically what was wrong back then...Thanks.
  • ... about running 2.2.19 (on RedHat 6.2) on my dual PII server. Guess it's about time to upgrade, though. RedHat 7.2 is looking nice.

    How is bigmem support coming along? Is 2.4 still having problems with (32-bit) systems sporting more than 2 GB of ram?
    • How is bigmem support coming along? Is 2.4 still having problems with (32-bit) systems sporting more than 2GB of ram?

      I don't belive 32bit machines can address more than 2gb of ram, that's one of the reasons there are 64bit machines. At work we have an SGI Octane2 that does HDTV work, it has 6gb installed and can support 8gb. It's a 64bit machine and runs IRIX64 v.6.5.13, a 64bit OS.
    • Just a note...

      RedHat 7.2 ships with a modified linux 2.4.7,
      so you will not get the AVM (Andrea's VM).

      Also RH7.2 kernel still has the local root ptrace vulnerability, so you will need to upgrade to kernel-2.4.9 right away. RPMS are available at the usual places.

      Then you can feel good...
  • by rho ( 6063 ) on Friday November 02, 2001 @03:13PM (#2513337) Homepage Journal

    Quite often I get the feeling that Linux and BSD are doing quite a bit of "me-too"-isms in an attempt to catch up with the mainstream OSes--including MS, Apple and commercial Unixen.

    I read this story and wonder if I should still be getting the same feeling -- isn't a VM subsystem mostly a solved problem? Or am I reading this wrong, and this is merely tweaking and specialization?

    Since I'm no Alan Cox (I'm closer to Alan Thicke), I can't see the truth of the matter, but I get the feeling that we're doing a lot of walking in a tight circle on the path, while others have already left the forest.

    • It's hard to say if the VM subsystem has been completely reworked in the MS operating systems. It's all closed source. But I think it's a fair guess that NT and 9x had completely different VM subsystems, and in addition, that the Win2k VM subsystem is likely a complete rework of the NT 4 VM. I guess it's something that happens from time to time... someone thinks of a different way of doing things that changes how everything works together, and it makes something faster, something slower.
      • That makes sense, but it still sounds to me that the bickering between Rik and Andrea's VM is more fundamental.

        (which means little, since I only understand one word in three in a technical comparison between the two)

        • Well, things changed enough between the 2.2 series kernels and the 2.4 series kernels that a change in the way of thinking MIGHT make things better. Someone came up with a different way of doing things... The debate between which VM could be compared to the Vi vs. Emacs debate. The "which is better" debate isn't always an objective thing. You can't simply say Emacs is better or Vi is better. The question becomes "what direction do we want to take the kernel." No only that, but Linus was butting in a bit on Alan's territory, which added some heat to the debate. In the middle of a stable tree isn't necessarily the best place to add a completely new VM subsystem. Linus thought it was important enough to add now. Linus'll do anything to squeeze out a few more points on the Infinite Loops Per Second benchmark.
        • by Mr. Fred Smoothie ( 302446 ) on Friday November 02, 2001 @04:56PM (#2513995)
          Actually, the real debate on LKML was not whether something drastic needed to be done about the poor performance of the early 2.4 VMs, but *when* that should occur.

          Basically, the people who sided with Linus/Andrea were of the opinion that "things are so bad now [which was between 2.4.5 and 2.4.9] that a complete replacement of the VM even in a 'stable' kernel series is justified", and those who sided with Alan Cox/Ben La Haise/Rik van Riel thought that the existing VM code could be massaged and tweaked enough so that the performance would become acceptable and huge changes could be postponed until 2.5 opened.

          This was complicated by the fact that between 2.4.5 and 2.4.9, the -ac series had accepted patches from Rik which weren't applied in the Linus branch and did in fact seem to be fairly successful in increasing performance through much less intrusive code changes. This was one of the main complaints of the Alan/Ben/Rik contingent; that the problems had already been largely resolved in the -ac tree, and that that approach should have been applied in Linus' tree before jumping to a complete rewrite.

          At this point, a consensus seems to be forming that the Andrea VM is *much* simpler, the changes haven't had much adverse effect on other subsystems, and the performance is just as good or better than the VM in the -ac series.

          The question of whether or not it should have waited until 2.5 is one that will probably never be answered to everyone's satisfaction, but at least will soon be academic.

      • "But I think it's a fair guess that NT and 9x had completely different VM subsystems..."

        I believe these operating systems have completely different *kernels*. The chance of them having the same VM subsystems seems slim.

        -Paul Komarek
    • by Anonymous Coward on Friday November 02, 2001 @03:24PM (#2513394)
      Making a VM subsystem is easy enough. Making a very high performance one that works well in as many cases as possible is not so easy - most OSes have a myriad of tweakable parameters (including Linux /proc/ files and mysterious NT registry keys, for example) to handle all the different special cases - but it's still a bit of a black art, since bizarre things like what sectors on the HD hold the swapped-out memory can make a big difference (personally I have a separate swap harddrive, but that's because I'm running nasty finite element analysis problems).

      Also, the VM underlies a host of other bits of the OS, and as they change, so the VM has to change to accomodate them - for example, Linux's zero-copy unix domain sockets, or Linux's VFS layer.

      In short, no, VM design is not 100% solved.

      • by Ami Ganguli ( 921 ) on Friday November 02, 2001 @03:28PM (#2513409) Homepage

        Another factor is probably the tremendous range of hardware and workloads that Linux tries to handle. I don't think any other OS attempts to work well on watches and mainframes (and everything inbetween) while using the same code base.

        • Using the same code base is part of the problem under linux. Linus's aim is to keep linux performance up on old hardware like 386/486 systems (it's his baby and that's his choice). However, for linux to make a killing in the server field, it needs to be able to handle >2GB RAM and >4CPUs well. From what I've read, keeping linux running on old (and, let's face it, obsolete) hardware prevents it running as well as it could on these high powered systems.

          At this point we enter into arguments about "what is most important", but keeping the same code base for disparate functions is not an ideal situation.

      • by rho ( 6063 )
        Also, the VM underlies a host of other bits of the OS, and as they change, so the VM has to change to accomodate them - for example, Linux's zero-copy unix domain sockets, or Linux's VFS layer. In short, no, VM design is not 100% solved.

        That's interesting. I'm operating on my simplistic, naive notion that a VM is "the hard drive, where you dump pages when you're short on RAM or they get really stale". Thus, in my simple little world, the VM subsystem is affected the most by tweaks to the scheduler that swaps out pages. Is that where the major differences between the two VM schemes lie?

        If so, wouldn't it be a worthwhile effort to modularize that part out? (suddenly I see kernel hackers turning white as a sheet, gripping their chair arms in a fit of white-knuckled fear and loathing)

        I'm just a twink asking dumb questions...

        • That's interesting. I'm operating on my simplistic, naive notion that a VM is "the hard drive, where you dump pages when you're short on RAM or they get really stale". Thus, in my simple little world, the VM subsystem is affected the most by tweaks to the scheduler that swaps out pages. Is that where the major differences between the two VM schemes lie?
          Actually, the VM is "the subsystem which keeps you from getting short on RAM, by dumping pages to the hard drive when they get stale, while not swapping unnecessarily because of the big impact that disk I/O has on system performance."
          • Re:Not that simple (Score:2, Informative)

            by slamb ( 119285 )

            Actually, the VM is "the subsystem which keeps you from getting short on RAM, by dumping pages to the hard drive when they get stale, while not swapping unnecessarily because of the big impact that disk I/O has on system performance."

            It's not that simple, either ;)

            It does everything you said but also tries to minimize disk I/O by caching parts of the disk in memory. It has to maintain a balance between maximizing the cache and minimizing swap usage. I believe recently they've also talked about doing quite a bit more lookahead on the cache...if you're accessing one disk block/page/whatever, grabbing subsequent ones as well. (I'm not sure if this is the next block of the physical disk or the file, but that's not the point.) That would be an additional complication.

        • by Paul Jakma ( 2677 ) on Friday November 02, 2001 @06:45PM (#2514514) Homepage Journal
          I'll have a stab at this one... not all the details might be correct, but it should be close enough to get the idea..

          VM is virtual memory. really in this context it should be: VMM, ie Virtual Memory Management.

          VM refers to the fact that on modern processors memory addresses used by processes do not refer to the physical location. Rather the address is a virtual address, and the processor translates it by some means to the physical address.

          Eg, if a process accesses memory at 0xfe12a201, the physical memory accessed might be 0x0000c445. The former address is a 'virtual' address, the latter is physical.


          Processors work with memory in discrete chunks called pages. A page might be 4KB of memory (eg on intel), or some other value. Each page has a number, a frame number (PFN), that identifies it. The part of the processor that deals with handling virtual memory is the Memory Management Unit (MMU). The MMU and operating system together maintain a set of tables that describe which pages correspond to which virtual memory addresses. These tables are known as "Page Tables" or "PT", each entry in a page table is a "Page Table Entry" or "PTE". A page table is usually held within one or more set of physical pages. Each process has it's own set of page tables. The MMU interprets a part of the virtual address as an index:offset into the page tables. By looking up the PTE at offset x in the PG indexed by y, the MMU can determine which physical memory address corresponds to a virtual address (and more besides).


          process accesses memory at 0xfe12a201.

          MMU interprets 0xfe12a as the index, and retrieves the 0xfe entry (PTE) in the page table, which tells it which PFN the virtual address refers to. it then uses 0x201 as the offset into that page and fetches/operates on the memory located there.


          - virtual address -> split into index and offset.
          - index gives you the PTE.
          - the PTE holds the frame number of the physical page (and some other stuff)
          - the offset is the location within the frame

          So everytime, (well nearly everytime), a process accesses memory, the MMU translates the virtual address in the above way. To speed things the MMU maintains a cache of translations in a unit known as the Translation Lookaside Buffer (TLB), which holds recent translations. If the MMU finds a translation there, it doesn't need to do the full lookup process.

          so where does the operating system, or rather it's VMM, come in? well, a MMU might find that when it goes to look up an index, that no valid PT or PTE exists. This might indicate the process is trying to access memory that it hasnt been allocated, the MMU would then raise a fault and switch control to the operating systems VMM code, which would probably decide to end the process with a memory access violation, eg SEGV under Unix, and perhaps dump the processes memory to a file to aid debugging. (a core dump.)

          also, the PTE holds more than just the frame number. There are various extra bits which the MMU and operating system can use to indicate the status of a page.

          Eg, one bit may indicate whether the page is valid or not. an OS'es VMM could use this to make sure that the MMU faults control to the VMM next time the page is accessed, perhaps to allow the VMM to read the page from disk into memory (ie swap).

          Other bits may indicate permission. Eg whether a page may be read or written to. This can facilate shared libraries by allowing an OS to map the same physical pages into the page tables of several different processes. Also facilites copy on write, for optimising fork().

          The cpu's MMU may maintain an 'accessed' bit and a 'written' bit to indicate whether a page has been accessed/written to since the last time the bit was cleared, so that the operating system can do bookkeeping and make informed decisions about memory activity.

          etc.. etc..

          The VMM's job beyond interacting closely with the MMU is to juggle which pages are kept in memory and which are swapped out to disk. perhaps if the OS does paged buffering, the VMM may also need to decide which buffer pages need to be written to disk (or read in pages to buffers). there are many ways it could do this, eg by maintaining lists of how and how often pages are used and make decisions about what to write out/read in based on that.

          it is in these intricate details that the various 2.4 VMMs differ.

          NB: details above are very architecture specific. different processors will have different page table layouts, different PTE status bits, etc.. eg on intel the virtual address is actually

          directory index : page table index: offset
          11 bits : 9 bits : 12 bits

          the directory index is an index to a directory of page table numbers, which saves on the amount of memory you need to hold a page table. the upshot being that the fine details of how paging works are processor specific.

          More NB's:

          page tables are process specific, switching between processes usually requires loading in the (set of) page tables of the new process. it also requires clearing out/invalidating all the existing TLB entries. this all takes time.

          Intel have an extension to their paged addressing, PAE, which allows for 36 bit physical addresses. I'm not sure, but i think it does this by splitting the directory index into 2 indexes, and increasing the PTE size to 36bits. (uses 64bits of memory though.)

          finally... there is plenty of reference materical on the web, so research for yourself cause i'm probably wrong in a lot of places. ah well.. :)
    • I'd like to second this post.

      I have a vague sense that BSD and Solaris stand up better than linux under heavy loads, but I'm not sure why, if it's the VM, the scheduler, or the way the various systems interact.

      But the thing that I don't understand is the controversy surrounding the VM. I can understand controversey about whether to change horses in the middle of 2.4 -- it seems like there would be legitimate arguments on both sides.

      But isn't an optimal VM design something that's been clarified by research, or at least by the experiences of those who have built other kernels?

      I guess what I don't understand is why there would be a gap between Solaris and Linux, or BSD and Linux. Can't Linux just do it the BSD way? Is it an inertia thing, a matter of not wanting to break things?

      To ask the question another way, my big pet peeve with Windows 2K is that copying a big file (100's of megs) locks up the system. Solaris doesn't do that, Linux doesn't do that. Why does NT do it? Can't they look at the code in the Linux kernel and do something similar? I can't believe that the NT developers don't think it's a problem, I can't believe that they're not very smart guys. So what's the problem?

      BSD's missing some things that are hard to do without -- their java support isn't as good, and they don't have the device drivers that Linux has. It seems like it would be easier for linux to pick up what it lacks that BSD has, than the other way around.

      • Win2k locks up when you copy large files? I haven't seen this problem. I have a P4 system with 256 MB RAM, I have Win2k and Mandrake 8.0 dual boot. I have the ISOs for Mandrake on the Win2k partition and I copied them to another folder with no problem. These are 650 MB files. I believe that when you copy a file 2k reads a chunk and then writes it, then repeats. I wouldn't make sense to try to copy the entire contents into memory then write it. Besides there may not be enough concurrent sectors to store it, so it would have to break it into chuncks anyway. Anyway, I think if this were a widespread problem we would have seen a patch a while back. just my opinion.
      • I remember reading about a problem with sblive and via chipsets having timing problems, at one point this caused file corruption as well!

        Until SP2 (I believe) Win2k did not natively support ATA100 drives, so many boards either provided their own, or in the case of ATA-RAID controls the drivers we're shown to the system as SCSI. In my experience the native ATA-100 or 3rd party ATA-100 drivers do not perform as well as their SCSI counterparts. long shortly short, if you have a ATA-RAID disk, use the SCSI driver provided from the manufacture... Also if it's a HighPoint(tm) controller don't use the latest version of their drivers, they're is an issue with it where the mouse jumps and the sound card skips.

        If you do have a via board, you may want to check out viahardware.com, they have some excellent FAQ's as well as all the latest drivers.

        At any rate I am 99% sure it has something to do with your hardware, bios or drivers not Win2k specifically.

      • To ask the question another way, my big pet peeve with Windows 2K is that copying a big file (100's of megs) locks up the system.

        Interesting. I've copied files around that are several gigabytes in size and suffered no ill effects.

        What were you using to copy the files? Command line, Explorer? Define 'locks up the system'?
    • There's really no way to tell if the Windows VM is bad and is just being left as is, or is being changed, or is good, because you obviously can't run some other VM under the same load (even if you switch between Windows versions, it's a different load because you've got different system implementations and such).

      The design of a VM system also depends a lot on the rest of the system: the best VM is one that always has a page swapped in when you need it, and always has it in an acceptable part of memory. But which page is going to be needed next depends on what sorts of programs you're running, and tons of other factors. The VM system is trying to guess these, and there are some known heuristics for guessing, but there's no right solution for VMs in general.

      Apple has had a very different scheduling algorithm, which makes the problem totally different, and much easier: the applications not in front can be swapped out.

      I believe that, for a commercial UNIX, if you need swap, then you didn't put in enough RAM. If you could buy the system in the first place, you can afford more RAM. If the OS doesn't support enough RAM, get a version that does.
      • I believe that, for a commercial UNIX, if you need swap, then you didn't put in enough RAM. If you could buy the system in the first place, you can afford more RAM. If the OS doesn't support enough RAM, get a version that does.

        This I agree with. It was different 5-10 years ago, when 128 megs of RAM would buy you a pretty nice Honda. Now that you get RAM with your Happy Meals, it's quite different.

        Following your logic, a VM will mostly be used in a workstation or desktop situation -- server applications aren't a real focal point. The scheduling is fundamentally different between a workstation (say, a hacker's main axe from which he runs Emacs and gdb, or a 3D animator's bench that runs nothing but Maya) and a desktop that is likely to have 4 or 5 apps open at once with constant switching between them.

        Isn't this another argument for modularity of the VM, or at least the scheduler? Or am I missing something more fundamental?

        • Your missing something. 4gb of Ram is all many OSes will support. a 386 can address several terrabytes, but you need to use funky segmentation registers, and anyone who remembers Dos wants nothing to do with that. I don't know about the latest linux versions, but FreeBSD doesn't do it, and I'm sure early linux versions don't. i'd be surprized if current linux versions handle that much. (Note, 32 bit machines only, I'm sure 64 bit machines like alpha supprort much more)

          Of course if you need more than 4 gb of Ram you also need programs that can handle that, and I know of no such thing.

          • Of course if you need more than 4 gb of Ram you also need programs that can handle that, and I know of no such thing.

            Not neccessarily. There is going to be more than one process in existance, each of which needs memory. If you have say 8 Gb of physical ram, you could have 2 programs, each one of which access 4 Gb.

            I remember using a system which didn't do SMP, each processor had it's own private memory. Basically once a process was created it either ran on that processor forever, or it would be swapped to a new processor. In this case obviously no single program could be allocated all the physical memory, but it was still useful.

        • It's hard to make a VM all that modular, of necessity it's closely integrated with the rest of the system.
    • I think it's the other way around. Both Linux and BSD had a VM long before the "mainstream" OSes had proper VMs. When Linux first came out, Windows 3.1 was mainstream. BSD was around before Windows 3.1.

      So really it's the other way around - the mainstream OSes are playing catch-up :-) (And I've had cause to need to find out about the gory details of NT4's VMM this week too).
    • by Azog ( 20907 ) on Friday November 02, 2001 @06:13PM (#2514375) Homepage
      No, it isn't a "solved" problem. And the Linux VM subsystem is a surprisingly good one.

      Remember that benchmarking Linux against other OS'es back in the 2.2 kernel days showed that Linux was at least in the same ballpark as the best BSD and Microsoft OS'es, and the 2.4 kernels are even faster.

      Of course there are lots of well known algorithms and approaches - take an advanced computer science operating systems course to find out - but it's a really difficult problem and it changes all the time, because hardware and user level software changes all the time. It's a combination of an art and a science. Many, many things have to be balanced against each other, hopefully using self-tuning systems.

      An excellent VM for running one workload (say, a database) might suck horribly when running a different workload (like a huge multiprocess scientific computation).

      Here are some of the things that make VM complicated. Consider how other operating systems deal with these:

      - Virtual Memory. Many applications allocate far more memory than they ever use. People expect this to work. So almost all VM's allow programs to allocate much more memory than is actually available, even when including swap. That makes the next point more tricky:

      - Out Of Memory. What should happen when a system runs out of memory? How do you detect when you are out of memory? If you are going to start killing processes when the system runs out of memory, what process should be chosen to die?

      - Multiprocessors. List of memory pages need to be accessed safely by multiple processors at the same time. And this needs to happen quickly, even on systems with 64 or more processors.

      - Portability. The Linux VM runs on everything from 486'es with 8 MB of RAM and 100 MB of swap to 16-processor, 64 GB RAM RISC systems to IBM 390 mainframes. These systems tend to have different hardware support - the details of the hardware TLB's, MMUs, CPU cache layout, CPU cache coherency... it's amazing how portable Linux is.

      - Interaction of the VM with file systems. File systems use a lot of virtual memory, for buffering and cacheing. These parts of the system need to communicate with eachother and work together well to maximize performance. Linux supports a lot of filesystems and this gets complicated. For example, you may want to discard buffered file data while keeping metadata in memory when available memory is low.

      - Swap. When should a system start swapping out? How hard should it try to swap out? What algorithms should be used to determine what pages should be swapped out? When swapping in, how much read-ahead should you do? Read ahead on swap-in might speed things up, but not if you are short on memory and end up discarding other pages...

      - Accounting for memory usage is complicated by (among other things) memory-mapped files, memory shared between multiple processes, memory being used as buffers, and memory "locked" in to be non-swappable.

      - Keeping track of the state of each page of memory - is it dirty (modified)? Anonymous? Buffer? Discardable? Zeroed out for reuse? Shared? Locked? Some combination of the above?

      - Even worse: memory zones. On some multiprocessor systems, each processor may be able to access all the memory, but some (local RAM) may be reachable faster than others. The VM system should keep track of this and try to use faster memory when possible - but how do you balance that when the fast local RAM is getting full?

      - Interactions with networking and other drivers. Sometimes drivers need to allocate memory to do what they do. This can get very tricky when the system is low on memory. What if you need to allocate memory in a filesystem driver in order to write out data to the disk to make space because you are running out of memory? Meanwhile network packets are arriving and you need to allocate memory to store them. Sometimes hardware devices need to have multiple contiguous pages allocated for doing DMA, but if space is tight it can be very hard to find contiguous blocks of free memory.

      I'm not an expert on VM's either, but I've taken courses on operating system design and I read the kernel mailing list --- it is a hard, hard problem to make a fast, reliable, portable, feature-rich system.

      • This was helpful, as was the long post below that had a fantastic overview of the process behind a VM. Even if it wasn't correct, it helped me form a mental picture of what was happening.

        It sounds, though, that the major source of complication is the multi-purpose nature of the kernel: scalability problems, for want of a better term. A database server is different from a desktop from a mainframe from a PocketPC.

        Is it worth the effort to modularize the VM subsystem? If not completely modularized, enough parameters moved to configurable settings in /proc files where these decisions can be made more sanely for different environments? (this may be the case now -- I have no idea, since my customization needs have never met that level of granularity)

        Thus, RedHat Server has a different VM than RedHat Desktop, and Debian users can apt-get a configuration for their database server.

        If a VM is hard to engineer because of the different ways it is used, then engineer it to be flexible for different uses. Not many bridges are built to support pedestrian, automobile and train traffic all at the same time (and still meet community standards for visual appeal).

    • In a traditional kernel -- which is all Linus aspires to for the Linux kernel -- everything is basically a solved problem. There's lots of tweaking, and testing, and technical debates -- but there's really not anything all that interesting.

      It is somewhat interesting, because the kernel is the foundation for everything else. Well, libc is nearly as basic. So what happens at that low level effects everything, whether you run KDE or Gnome or just the command line.

  • to know how XPs kernel would do, and how the different *BSDs, QNX, whatever you have under your sleave.

    Heck, know what would be the best? A pluggable kernel system, where anyone could switch WM. Hmm, hurd? Anyways, it's nice to see 2.4 making progress, but we all kinda guessed that.

    After all, if we know what is "best", then people could try to break that and become even better. And who could loose from that?;)
    • Heck, know what would be the best? A pluggable kernel system, where anyone could switch WM.

      That's been suggested for Linux before, and the general feeling was that that would be so complicated (the memory manager changes touched most files in the kernel) and hard to test, that it would basically be a nightmare.

      • That's not flamebait, it's practically from the lips of Alan Cox himself. But don't let that interrupt your rush to mis-moderate...

      • Heh, If I ever said anything like this at work I'd get a "Are you afraid to code" out of my boss...

        Systems are getting more complex and demands on them get more complex. People have to plan harder and think harder to keep up. It's time to step up to the plate or go home. Years ago Linux didn't need a VM that worked with 4 way SMP and 2 Gig of ram. Today it does.

        Claiming that modularity is (I'm paraphrasing) "too hard" comes off more as a cop-out than a reason. If it's hard, do a better job at it. I don't think anyone is claiming that the VM should only take a weekend to do... They just want it done and done right. The argument FOR modularity put forth above is a solid one. Whatever planning/archeteting/coding/testing and debugging that takes. Just do it[tm].
        • Claiming that modularity is (I'm paraphrasing) "too hard" comes off more as a cop-out than a reason

          Yes, let's make a fundamental, pervasive, part of the kernel hot pluggable, introduce tons of potential bugs and incompatibilities and create lots of work, all for questionable benefit. Engineering involves a series of tradeoffs and, in this case, most people see the pain as being too great for the potential payoff.

          But, if you still want to, go right ahead. That's one of the cool thing about Linux: if nobody else wants to do it, you still can.

        • Especially if you could plug the same VM into a BSD kernel or whatever you might want/make. I'd like to see a more modular way of building OSes, so that we can get a more dynamic way of doing things and not lock everything in the Linux kernel and go "if you want it changed, change it", is if everyone only codes kernels...

          The pitfall would probably be how to define the APIs and how to handle their aging...
        • Like everything else in computing, modularity comes at a price.
  • by Anonymous Coward on Friday November 02, 2001 @03:20PM (#2513376)
    I don't think most people really thought the 2.4 VM was a worse performer than 2.2, especially under normal load, and in recent kernels even under high loads.

    However, one thing that was not evaluated in this writeup at all was stability, especially on big boxes (as in SMP and >1GB) and heavy workloads. This is where neither VM really seems to be able to hang in there.

    I admin seven such boxes that all have 2 or 4 CPUs, and 2 of 4GB of RAM, and during heavy jobs get hit pretty hard. These things run 100% rock solid with 2.2.19, I've achieved uptimes of greated than six months on all boxes simultaneaously. Basically, reboots are for kernel upgraded, nothing more.

    With 2.4.x, I'm happy to get a few weeks, and sometimes much less. The machine practically always dies during heavy VM load. It has kept me from upgrading to 2.4 for several months now.

    The real kicker, when 2.4 is running correctly, my jobs run as much as 35-50% faster than with 2.2, especially on the 4 CPU server, so I really wish the VM was stable enough to allow me to move.

    Anyway, I'm sure it will get there sometime.

    BTW, before people write about how they're 2.4 boxes have super long uptimes let me say that I too have some 2.4 based systems that have been up since 2.4.2 was released, but these machines are either single CPU, or SMP but with 512MB of RAM. 2.4 seems to run quite well in this case.
    • I have two dual boxes with 1g each in them and they seem to take whatever I throw at them. I have yet to have any panics or what-not. (So you know what kind of loads I'm talking about, one box runs 3 CS servers with 3 HLTV proxies and the other runs mysql, apache (slash), hlstats for the servers and is my desktop box. Both run DNS and RC5 and have 2.4.10 or greater). Nothing too bad, but not idle either). I have no complaints about the kernels yet.

      Now getting ext3 in the tree is something I would like :)

    • by Anonymous Coward
      hmm..i have hard lockups like the guy described with 2.2.19 on a single CPU 733MHz P-III system with 256MB of RAM and 2 Gigs of swap. im running vmware with 5 OSes running simultaneously under high loads and it typically locks up after 3-4 weeks.
    • However, one thing that was not evaluated in this writeup at all was stability, especially on big boxes (as in SMP and >1GB) and heavy workloads. This is where neither VM really seems to be able to hang in there.

      This is a pretty significant issue to me, too, because I admin a dual processor box with 1GB running MySQL under the 2.4.2 kernel, and it bogs down under relatively light loads, swapping like mad, and eventually righting itself if given enough time. Once news of the VM problems came to my attention, I started looking into it, and I confess I'm stumped. There's no reason I can't roll back to 2.2.19 or forward to whatever the 2.4.x kernel du jour is. At this point, I'm inclined to go back to 2.2.19 for the time being because there were no problems there, but I'd certainly rather go forward if I can.

      Either way, I need to fix it soon, because the database in question drives a busy commercial website, and my boss is quite legitimately upset about the problem.

  • by Karmageddon ( 186836 ) on Friday November 02, 2001 @03:21PM (#2513385)
    when I learned computer science--which I admit was a long time ago, but that means the "gurus" have all had plenty of time to catch up--they taught us that if you obeyed the principles of modularity that you could have more than one implementation of something and use what was appropriate for the particulars of a given situation....

    ...so why does linux have 1 VM? it seems that 2 of them exist, and the BSD's have more... guys, "gimme a hunk" and "page fault" aren't exactly rocket science anymore, particularly with hardware support... the fact that there is room to make a big deal out of this is the problem, not the VMs.

    • so why does linux have 1 VM? it seems that 2 of them exist, and the BSD's have more... guys, "gimme a hunk" and "page fault" aren't exactly rocket science anymore, particularly with hardware support... the fact that there is room to make a big deal out of this is the problem, not the VMs.

      If Linux was a microkernel I'm pretty sure this would be possible but from what I've seen of the Linux kernel code and from some discussions on the linux kernel mailing list [zork.net], the virtual memory code is too entrenched in various parts of the code to be #ifdefed around with any sort of ease.
    • by Anonymous Coward
    • What good is it to have more than one VM? Honestly, just because something could be made into a feature doesn't mean that it should be.

      I guess I have heard about other operating systems where you could choose between vm's but that doesn't make much sense to me. Did one of the vm's fail under certain cases? If so then it should have been fixed instead of just patching over the problem.

      To me it makes more sense to just have one vm that works and is well understood.
    • Computer theory is nice and all, but I wonder how often reality impinges on the nice though processes of theorists. The VM is an awefully low level piece of code. Almost everything else in the system touches the VM. Its just really hard to make something like that a module that can be swapped out. It has been done, but its very complex. Even in microkernels the VM is usually implemented partly in the kernel, with only the pager (the thing that decides what to page out) being a modular component.
  • Here [byte.com] is an excellent article I found linked on rootprompt [rootprompt.org] yesterday that goes into considerable detail about the 2.4 VM (or VMs, as the case may be).

  • I am glad to see the 2.4 VMs doing so well. I assume that Linus is not at all satisfied with the VM code and that is the reason the 2.5 branch is not started. Hopefully it will start soon when the VM trouble is solved!
  • by duffbeer703 ( 177751 ) on Friday November 02, 2001 @03:39PM (#2513460)
    But the fact remains is that this VM holy war should have been resolved in the 2.3 series of kernels.

    The number of major problems and architectural changes that are being made to the supposedly 'stable' branch of Linux kernel is really run amok.

    I'm sure there's plenty of outrages to come as bad bugs are found in the volume manager and other new elements of 2.4
    • I think we've been over this ground before.

      The reason Linus released 2.4 as prematurely as he did was to get it tested because all the potential testers were shying away from a development kernel. I wouldn't have felt comfortable running it myself.

      One possible solution is to have a testing kernel. Then we'd all be happy, well, at least I would.
  • by Miles ( 79172 ) on Friday November 02, 2001 @03:53PM (#2513538) Homepage
    Just a possibly interesting data point. I played Unreal Tournament with 2.4.12-ac5 and 2.4.10 (both from Debian). 2.4.10 always seems to work fine for extended periods of Lan play (as both a client and server), whereas the 2.4.12-ac5 choked after a few games--the swap ended up being nearly all used up.
    Of course, this was hardly a scientific test, but I think I'll stick to something proven for now.
  • 2.4.13 VM (Score:5, Informative)

    by sfe_software ( 220870 ) on Friday November 02, 2001 @03:57PM (#2513567) Homepage
    I can't speak for the differences between the two VM layers in the most recent versions of each, but I went from 2.4.7-2 (RH Roswell Beta stock kernel) to 2.4.13 (+ext3 patch), and I've noticed a serious improvement.

    My notebook has 192 megs and 256 meg swap partition. I run Mozilla constantly (which seems to constantly grow in memory usage as the days pass). Prior to the upgrade (2.4.7-2, recompiled without the debugging options RH had on by default), swapping was ungodly slow. Switching between Mozilla and an xterm would literally take a few seconds waiting for the window to draw on the screen. Even switching between tabs in Moz was slow.

    Since going to 2.4.13 with ext3 patch, I've noticed a serious improvement in this behavior. Under the same conditions (between 20 and 50 megs swap usage), switching between windows is quite fast. I don't know if it's faster at swapping per se, or if it's just swapping different things (eg, more intelligently deciding what to swap out), but for me it "seems" much faster for day-to-day usage.

    I haven't yet tested in a server environment... but for desktop usage, 2.4.13 rocks. Can't wait for 2.4.14, to see if any noticable improvements are added...

    Though it will be a non-issue once I add another 128 megs to this machine, it's nice to see such great VM performance under (relatively) low memory conditions.
    • Actually, with 2.4.14-pre6 (I'm running the XFS CVS code with the preempt patches) Linus has apperently declared the VM fixed and dared people to break it. I've been using the kernel quite heavily for a few days, and its been great. I don't know how much of an improvement it is over 2.4.13, though. I had 2.4.13 for a total of a day before XFS CVS updated to the 2.4.14-pre series. (Linus releases pre kernels at an ungodly rate, and the XFS team manages to keep up.)
  • I thought it was interesting how the author discussed the mid-release swap out of VM code in the 2.4 series kernel. He mentioned that Linus had felt that the AA version of the code was better than the existing version and wholesale swapped it out.

    Does anyone know why Linus did this? Are there some empirical results somewhere that dictate a reason to do this? Certainly, it would seem that there would be, but the author didn't point it out.

    Also the author pointed out that 2.2.x kernel's break at very high load levels. Is there documentation somewhere discussing what that might be?

    Take care,

    100% Linux Web Hosting [assortedinternet.com]

  • tcsh time variable (Score:4, Informative)

    by brer_rabbit ( 195413 ) on Friday November 02, 2001 @05:37PM (#2514209) Journal
    I don't know about other shells, but tcsh has some features that provide other useless statistics. You can set a variable called "time" that can provide additional information. From the tcsh man page [edited]:

    time: If set to a number, then the time builtin (q.v.) executes automatically after each command which takes more than that many CPU seconds. If there is a second word, it is used as a format string for the output of the time builtin. (u) The following sequences may be used in the format string:

    %U The time the process spent in user mode in cpu seconds.
    %S The time the process spent in kernel mode in cpu seconds.
    %E The elapsed (wall clock) time in seconds.
    %P The CPU percentage computed as (%U + %S) / %E.
    %W Number of times the process was swapped.
    %X The average amount in (shared) text space used in Kbytes.
    %D The average amount in (unshared) data/stack space used in Kbytes.
    %K The total space used (%X + %D) in Kbytes.
    %M The maximum memory the process had in use at any time in Kbytes.
    %F The number of major page faults (page needed to be brought from disk).
    %R The number of minor page faults.

    Particularly, if you could measure the number of swaps/page faults in
    the different kernels it would be pretty useful. I've got $time set
    # have time report verbose useless statistics
    set time= ( 30 "%Uuser %Skernel %Eelapsed %Wswap %Xtxt %Ddata %Ktotal %Mmax %Fmajpf %Rminpf %I+%Oio" )
  • I do not follow *BSD nearly enough to make this kind of observation, but I thought I recalled that when Universal Virtual Memory [nec.com] was rolled into NetBSD, it was widely regarded as a good design. Anybody with much more VM design knowledge able to comment on how suitable a design like that one (or other well-regarded VM design from other Unixes) would be for the Linux kernel?
    • I've read the UVM paper twice (its like 200pp) and it really is a good design. I don't like the BSD idea of abstracting the VM away from the hardware so much (you spend a lot of memory in places you don't have to) but it does make the design cleaner. However, UVM is more of the mechanism part of the VM. Its the code that lets you share memory, memory map files, etc. What the Linux VMs are competing on, on the other hand, are the policy parts of the VM. What gets paged out and when. As it stands, UVM is probably tied with Linux on the low-level VM part (though UVM has a cleaner interface to the MMU hardare, IMO). The virtual memory filesystem cleaned up handling of shared memory segments significantly. (Before 2.4, swapping out normal memory and swapping out SysV shared memory were two different things). It has many more (cool) features, such as sharing address spaces, trading address space, etc, but few *NIX apps use those features (since its not standard *NIX API). FreeBSD probably has the best policy mechanism right now. The low-level VM is a heavily-tweeked one based on Mach's, but it doesn't have most of the modern features of UVM.
  • by tuxlove ( 316502 ) on Friday November 02, 2001 @07:09PM (#2514633)
    He notes in his commentary that the 2.2 kernel "felt faster" or something to that effect, while still performing much worse in actual numbers. This is probably the manifestation of the a well-known effect in the world of performance: responsiveness and throughput are often mutually exclusive.

    In other words, given fixed parameters, it's usually not the case that you can improve both responsiveness and throughput at once. If you don't change memory, CPU speed or I/O bandwidth, and your code is devoid of excess baggage which effectively reduces one of the above, it is almost a given that the two are a tradeoff. I've personally experienced this numerous times in my own performance work, and have read the research of others that corroborate it.

    Here are some really interesting fundamental examples. One company I worked at lived and died by disk performance benchmarks, in particular the Neal Nelson benchmark. This test ran multiple concurrent processes, each of which read/wrote its own file. The files were prebuilt to be as contiguous on disk as possible so that sequential I/O operations wouldn't cause disk seeks. By the nature of the test, though, seeking would happen a lot because you had N processes each reading/writing a different contiguous file. So, you lost the benefit of the contiguousness. Until, that is, we came up with a way of scheduling disk I/Os which, given a choice of many pending I/Os in a queue, favored starting I/Os which were close to where the disk head happened to be. This wasn't your father's elevator sort! The disk head would hover in one spot for extended periods, even going backwards if necessary to avoid long seeks. It was a bit more sophisticated than that, but those are the basics.

    The effect was, if a process started a series of sequential I/O operations, such as reading a file from beginning to end, no other process could get much of anything through until it was done. So what did this do to performance? Well throughput shot through the roof because disk seeks were nonexistent. The test performed beautifully, as it only measured throughput, and we consistently won the day. However, I/O latency for the processes that had to wait was extremely high, sometimes on the order of minutes.

    Needless to say, these "enhancements" were only useful for benchmarking, or perhaps for a filesystem on which the only thing running were batch processes of some kind. It would feel slow as molasses to actual human users, verging on unusable if anyone started pounding the disk. You can't wait 60 seconds for your editor to crank up a one-page file (well, okay, we didn't use MS office in those days :). On paper it was fast as hell, in practice it seemed very slow.

    One paper I read on the subject of process scheduling postulated that by increasing the max time slice of a process you could improve performance. The idea was that you would context switch less, would improve the benefits of the CPU cache, and so on. They increased the time slice to something above 5 seconds and ran some tests. Of course, the throughput improved by some nontrivial amount. Predictably, though, the system became unusable by actual human users for the same reason as in my disk test example.

    The other extreme would be absolute responsiveness, in which you spend all your time making people happy but not getting any real work done. An example of this would be "thrashing", where the kernel spends most of its time context switching and not actually running any one process for an appreciable amount of time.

    The sweet spot for the real world is somewhere inbetween, perhaps a little closer to the throughput side of the spectrum. It sounds like this may be the direction they've gone with the 2.4 kernel, though I'm sure they've done a lot of optimizing and rearchitecting to improve performance overall.
    • Of course, it all depends on the application one is using the system for. If you are running a webserver, you can tend towards the throughput side because latencies are going to be limited by the latency of the net connection (at least 20-30ms usually). If you're doing real time audio, you want to tend towards the latency side. On a desktop machine, it is probably best to tend towards latency. I doubt anybody minds a compile finishing 10% slower if it means that their MP3s don't skip and their mouse doesn't jump around. Interestingly, the 2.4 kernel is actually pretty good about both (maybe excluding VM and some filesystem code). With the preempt patches, latency is down to 2ms or so, while throughput is almost unchanged (or even sped up, depending on the task). Of course, you make the assumption that there are no other things limiting the performance of the app. Having used Linux GUIs for awhile, I'd say that's quite an assumption...

Machines that have broken down will work perfectly when the repairman arrives.