Forgot your password?
typodupeerror
Operating Systems Linux

Anatomy of Linux Kernel Shared Memory 93

Posted by kdawson
from the culture-by-another-name dept.
An anonymous reader sends in an IBM DeveloperWorks backgrounder on Kernel Shared Memory in the 2.6.32 Linux kernel. KSM allows the hypervisor to increase the number of concurrent virtual machines by consolidating identical memory pages. The article covers the ideas behind KSM (such as storage de-duplication), its implementation, and how you manage it.
This discussion has been archived. No new comments can be posted.

Anatomy of Linux Kernel Shared Memory

Comments Filter:
  • First Post (Score:1, Redundant)

    by stink_eye (1582461)
    Seems to me this feature has been available for a while in VMWare...
    • Re: (Score:2, Informative)

      by Anonymous Coward

      VMWare? Fuck, this has been around for decades in the case of OS/360.

      • The use of KSM isn't without risk:

        "A region can be removed from the mergeable state through the MADV_UNMERGEABLE parameter (which immediately un-merges any merged pages from that region). Note that removing a region of pages through madvise can result in an EAGAIN error, as the operation could run out of memory during the un-merge process, potentially leading to even greater trouble (out-of-memory conditions). "

        KSM *will* overcommit the memory as it is designed to do so. And please also be mindful that Mur

    • Re:First Post (Score:5, Informative)

      by Anonymous Coward on Saturday April 17, 2010 @05:27PM (#31883766)

      From the article:
      "Going further

      Linux is not alone in using page sharing to improve memory efficiency, but it is unique in its implementation as an operating system feature. VMware's ESX server hypervisor provides this feature under the name Transparent Page Sharing (TPS), while XEN calls it Memory CoW. But whatever the name or implementation, the feature provides better memory utilization, allowing the operating system (or hypervisor, in the case of KVM) to over-commit memory to support greater numbers of applications or VMs. You can find KSM—and many other interesting features—in the latest 2.6.32 Linux kernel."

    • Re: (Score:1, Interesting)

      by BitZtream (692029)

      The concept isn't new to Windows, VMWare or FreeBSD I know for a fact (though none of them work exactly the same as this).

      I would have presumed this wasn't new to Linux either, just different from the existing implementation (I know its blasphemy here, but I'm not a Linux person).

      Its certainly been done in the mainframes for god knows how long.

      I doubt this is as groundbreaking as its being promoted.

      If your OS isn't sharing duplicate memory blocks already, you're using a shitty OS. (Linux already shares dup

      • Re:First Post (Score:5, Insightful)

        by Abcd1234 (188840) on Saturday April 17, 2010 @05:34PM (#31883798) Homepage

        If your OS isn't sharing duplicate memory blocks already, you're using a shitty OS. (Linux already shares dup read only blocks for many things, like most modern OSes).

        Umm, no.

        Most modern OSes share memory for executable images and shared libraries. In addition, some OSes, such as Linux, support copy-on-write semantics for memory pages in child processes created with fork (note, Solaris is an example of an OS that *doesn't* do this).

        Aside from that, there is no automated sharing of memory between processes. Frankly, I have no idea where you got the idea there was.

        • by keeboo (724305)

          In addition, some OSes, such as Linux, support copy-on-write semantics for memory pages in child processes created with fork (note, Solaris is an example of an OS that *doesn't* do this).

          Wait, Solaris doesn't do copy-on-write? (uh-oh)
          I, too, expected that to be pretty much standard nowadays.

          • Re: (Score:3, Insightful)

            by Trepidity (597)

            This article [sun.com] claims that on Solaris, "fork() has been improved over the years to use the COW (copy-on-write) semantics". It's sort of an in-passing comment, though, and I can't find a definitive statement in docs anywhere (the Solaris fork() manpage [sun.com] doesn't give any details).

        • In addition, some OSes, such as Linux, support copy-on-write semantics for memory pages in child processes created with fork (note, Solaris is an example of an OS that *doesn't* do this).

          What year do you live in? Solaris _9_ had COW and multiple page size support, over half a decade ago. Linux large page size support is a joke, Solaris x86 even does it better on Linux's home turf.

          http://www.sun.com/blueprints/0304/817-5917.pdf [sun.com]

          Most modern OSes have a native fibre channel stack, except notably, Linux which doesn't have userland utilities for managing SCSI devices or even fibre channel drivers for that matter.

          See what I did?

          • by Abcd1234 (188840)

            What year do you live in? Solaris _9_ had COW and multiple page size support, over half a decade ago.

            *snicker* I like how you phrased that as "half a decade ago"... "five years ago" sounds far less impressive, when you consider how industrial strength Solaris has traditionally been considered. :) That said, I fully concede my information is probably out-of-date. Glad to see Solaris finally moved into the 21st century!

            Besides, it wasn't a criticism of Solaris (the only reason I came across the factoid at

        • by veliath (5435)

          Solaris is an example of an OS that *doesn't* do this

          Are you sure? When the child modifies its globals or anything in its heap, the memory has to be COWed.

      • You are right, the "news" here, is that someone made an article explaining KSM, like, slashvertising.
      • Re: (Score:3, Interesting)

        by bzipitidoo (647217)

        Personally, I find memory compression [lwn.net] more interesting than just deduplication, which could be considered a subset of compression. The idea has been around for years. There used to be a commercial product for Windows 95 [wikipedia.org] that claimed to compress the contents of RAM, but which had many serious problems, such as that the software didn't work, didn't even attempt compression, and was basically "fraudware". The main problem with the idea was it proved extremely difficult for a compression algorithm to beat th

        • Now we have LZO, an algorithm that has relatively poor compression

          No kidding. We compress trace files with LZO at my day job and the compressed versions still have large human readable chunks.

          • by shadanan (806810)
            That's because the LZO algorithm only attempts compression on data with runs of the same repeating data, and data that would match multiple instances of a sliding dictionary. Data that doesn't satisfy these conditions within a given block are not compressed.
      • by IkeTo (27776)

        If your OS isn't sharing duplicate memory blocks already, you're using a shitty OS. (Linux already shares dup read only blocks for many things, like most modern OSes).

        That depends on how the memory gets duplicated. If it is duplicated because it comes from the same library or because it is the result of forking a process, you're right, every OS does that. But if it is because the memory content comes from independent processes doing independent things and the result happens to be exactly the same, it is new. In the former case, if two processes are sharing some memory, one process decide to write over it, and the other does exactly the same write, then the result is t

      • by improfane (855034)

        The breakthrough nature of this is that the hypervisor or the host OS is providing a virtual machine to every guest OS in the system. A virtual machine provides an environment that mirrors the real hardware, the OS knows no better. This in theory means that you could run multiple Linux distributions with the memory of a Linux kernel only being used once, meaning more applications can be run within these guest OSes or more guest Oses.

        That's why it is impressive.

    • Re:First Post (Score:4, Interesting)

      by Anpheus (908711) on Saturday April 17, 2010 @06:24PM (#31884040)

      For now, at least. VMWare doesn't support combining pages >= 2MB because the overhead (hit rate on finding duplicates versus the cost of searching for duplicates) and I suspect other hypervisors will do the same. Additionally, Intel and AMD are both moving to support 1GB page tables. What are the odds that you'll start up two VMs and their first 1GB of memory will remain identical for very long?

      The only way I see page sharing working in the future is if the hypervisor inspects the nested pages down to the VM level, which will typically be the 4KB pages we know and love. Either that, or paravirtualization support needs to exist for loading common code and objects into a shared pool.

      Even so, there's a lot of overhead from inspecting (hashing and then comparing) pages which will only grow as memory sizes grow. If we increase page sizes, the hit rate decreases and the overhead of copy-on-write increases. It's not a good situation.

      Sources: Performance Best Practices for vSphere 4 [vmware.com] which references Large Page Performance [vmware.com] which states:

      In ESX Server 3.5 and ESX Server 3i v3.5, large pages cannot be shared as copyonwrite pages. This means, the ESX Server page sharing technique might share less memory when large pages are used instead of small pages. In order to recover from nonsharable large pages, ESX Server uses a “sharebeforeswap” technique. When free machine memory is low and before swapping happens, the ESX Server kernel attempts to share identical small pages even if they are parts of large pages. As a result, the candidate large pages on the host machine are broken into small pages. In rare cases, you might experience performance issues with large pages. If this happens, you can disable large page support for the entire ESX Server host or for the individual virtual machine.

      That is, page sharing involves breaking up large pages, negating their performance benefit and is only used as a last ditch when you've overcommited memory and you're nearly to the point of having to hit the disk. And VMWare overcommit is great until you hit the disk, then it's a nightmare.

      • by x2A (858210)

        Be interesting to see actual speed differences of favouring larger page sizes vs fewer duplicate pages... on the one side, you get fewer TLB misses which slows things down, but on the other side, you can - in effect - be increasing your L2 cache, depending on scheduling. By which I mean that if you have a page that's shared between three processes (VMs or otherwise) then any cachelines covering data within that space effectively covers three times the memory that it would have to cover if they weren't share

  • they are so informative now that the KernelTrap isn't updating regularly ... AHEM! *cough! *cough! there is some reorganization of linux graphics and networking that needs to be updated. (or has linux networking changed sense 2005
    • by Anonymous Coward

      There are signs of life at KernelTrap (http://kerneltrap.org/ [kerneltrap.org]).

      There have been a number of postings by Jeremy since the beginning of April.

    • by ISoldat53 (977164)
      You should take something for that cough.
    • Re: (Score:1, Troll)

      by Runaway1956 (1322357)

      "or has linux networking changed sense 2005"

      No, Mr. Ballmer - nothing has changed in Linux in years now. It's safe to stick your head back up your arse, and assume that Windows is superior to everything in the world. The schmucks out there still believe it, and sales are stable.

  • by WrongSizeGlass (838941) on Saturday April 17, 2010 @05:09PM (#31883664)
    ... why OSS is the way things should be. You'll never see this type of documentation, and this type of detail, available to anyone and everyone from closed source software. I love my Mac, and supporting Windows pays my bills, but OSS is unlike any other animal out there.
    • by siride (974284)
      There are books detailing the NT Kernel and the OS X kernel (which is open source, after all).
    • by abigor (540274) on Saturday April 17, 2010 @05:41PM (#31883828)

      OS X's kernel is open source (BSD license) and very well documented.

    • Re: (Score:3, Insightful)

      by Saint Stephen (19450)

      Take it from a guy who's seen the NT source code: Inside Windows 2000, the windows kernel debuggers, and a firewire cable gave you MORE than enough detail; there's not much important that's not publicly known.

      It just doesn't make Slashdot or the sites you frequent. How do you think Windows Device Driver writers do their job?

      • Re: (Score:2, Insightful)

        by Anonymous Coward

        How do you think Windows Device Driver writers do their job?

        Very badly if my experience is anything to go on.

    • by cjb110 (200521)

      Sorry, but Open != Detailed Documentation any more than Closed does. See Mark Russinovich's blog [technet.com] for way too much detail about the Windows kernel and ecosystem.
      I'd also argue the MSDN site is far more comprehensive and easy to use (if they'd stop pratting about with the colours and layout every other week) than any single source of linux docs (if there even is one)
      MS do seem to realise that you can't write docs assuming the reader knows what the doc is about, unlike most OSS documentation that assumes way t

  • by mehemiah (971799)
    sense when did IBM care so much about Linux?
    • sense when did IBM care so much about Linux?

      The way I see it, the core businesses for IBM are hardware and services. Anything that helps feed the two is a good investment for IBM.

  • If you are running 10 processes on 10 servers on one physical machine... isn't easier and more efficient to run 100 processes on one instance of Linux?

    • by Chang (2714)

      In most cases where VM is useful the people who care about the 10 processes bring so much baggage in terms of demands that it pays big dividends to have the overhead of 10 machine images running in order to not have to listen to 10 people whining.

      There's IT theory and then there IT reality...

    • Easier? Yes. More Efficient? Yes. More secure from threats and bugs? Most likely not. 10 processes on 10 virtual servers means that if one process takes out the server, it takes out 9 other processes, not 99 other process, unless it can actually manage to screw over the hypervisor, which is very well protected.

    • by Trepidity (597)

      Yes, which is what operating-system-level virtualization, which is basically an extension of the old concept of chroots or jails, is intended to do: give you many of the benefits of virtualization without the overhead of having multiple full copies of the OS running. It can also manage some resources better, e.g. having a unified filesystem cache. OpenVZ [openvz.org] is Linux's approach.

      However, full virtualization, like Xen, is somewhat more rock-solid in its separation of the virtual machines, and also allows more fle

    • by IkeTo (27776)

      In my case, it is because the people running those 10 servers want to have their freedom to set different kernel parameters, to install different OS packages, to run their own Apache server with their own hostname all looking at port 80, and to hold root account without fighting each other. Most importantly, top efficiency is not a concern as the servers are not heavily loaded.

      • by murdocj (543661)
        How does it work for multiple copies of Apache to all be looking at port 80? I mean, from the outside world, there can only be one port 80 at that IP address, right?
        • by chris mazuc (8017)

          Yes [apache.org]

          • by murdocj (543661)

            That article [apache.org] starts with "These scenarios are those involving multiple web sites running on a single server, via name-based or IP-based virtual hosts"

            It sounds like this is talking about how to configure a single instance of Apache to serve up different websites based on the incoming IP address or the web site domain name. It doesn't sound like it applies to running multiple virtual machines, each of which has its own copy of Apache, each of which is trying to listen to port 80.

            Although if I'm wrong, I'm

            • by chris mazuc (8017)

              I was replying to your comment, not the article:

              How does it work for multiple copies of Apache to all be looking at port 80? I mean, from the outside world, there can only be one port 80 at that IP address, right?

              Realistically you wouldn't have completely separate instances of Apache on the same machine, hence the virtualhosts stuff. When you said multiple copies of Apache I assumed you meant they would be on the same server because if they were on VMs it doesn't make sense to say "multiple copies of Apache trying to listen to port 80".

              It doesn't sound like it applies to running multiple virtual machines, each of which has its own copy of Apache, each of which is trying to listen to port 80.

              Your VMs would have separate IP addresses with one copy of Apache per VM. If that is a problem then make it NAT and put a proxy [wikipedia.org] in fron

        • How does it work for multiple copies of Apache to all be looking at port 80? I mean, from the outside world, there can only be one port 80 at that IP address, right?

          Each VM normally gets its own IP, distinct from all other VMs and the host.

        • by x2A (858210)

          The only difference is on network namespace seperation. By default many things will listen on "port x" for IP address 0.0.0.0, which means, for example, Apache may eat up all port 80s for all IP addresses on the machine. Virtual machines get their own IP address and so you don't have to configure Apache to only listen on one, for other instances of Apache to be able to listen on their own. Somebody screwing up their configuration and accidentally listening on all port 80s won't stop the next persons Apache

    • by drsmithy (35869)

      If you are running 10 processes on 10 servers on one physical machine... isn't easier and more efficient to run 100 processes on one instance of Linux?

      That depends entirely on how you measure "easier" and "efficient".

    • No, really.

       

  • Kernel shared memory only acts on memory pages which it has been advised could be duplicates. This requires applications to specifically tag pages of memory as possibly being duplicates.

    Useful for virtualization (which is the primary purpose), but probably not actually functionally useful for more general memory de-duplication.

    • by mortonda (5175)

      Why don't you try reading the article.

      "What you'll soon discover is that although memory sharing in Linux is advantageous in virtualized environments (KSM was originally designed for use with the Kernel-based Virtual Machine [KVM]), it's also useful in non-virtualized environments. In fact, KSM was found to be beneficial even in embedded Linux systems, indicating the flexibility of the approach."

      • Maybe you should take your own advice.

        KSM relies on a higher-level application to provide instruction on which memory regions are candidates for merging. Although it would be possible for KSM to simply scan all anonymous pages in the system, it would be wasteful of CPU and memory (given the space necessary to manage the page-merging process). Therefore, applications can register the virtual areas that are likely to contain duplicate pages.

        • by mortonda (5175)

          But it can be applicable to non-virtualized apps. Furthermore, I see no reason that couldn't be "advisory" in nature, but still do a global scan if needed. The article didn't really say anything about that possibility.

          • Are you serious or are you a troll? The reason not to do a global scan is in the quote I just gave you form the article.

            Although it would be possible for KSM to simply scan all anonymous pages in the system, it would be wasteful of CPU and memory (given the space necessary to manage the page-merging process).

            You don't do a full memory scan because the red-black tree would more than double the memory usage during the scan.

  • Good on them. (Score:1, Insightful)

    by Anonymous Coward

    First off, as several people have said, IBM did VM over 40 years ago, hypervisors, full hardware virtualization, virtual memory, loads of failover type stuff since they are mainframes after all. Right now the modern wave of VMs are about where IBM was back then (with perhaps failover still being perfected.) IBM mainframe CPUs run the pipeline 2x, with a comparator in between the 2 copies to make sure everything matches. After 1 fault, it backs the pipeline up one and reruns it (and I'm sure logs a

  • "KSM allows the hypervisor to increase the number of concurrent virtual machines by consolidating identical memory pages."

    But first you have to waterboard it.

  • flying gristle (Score:1, Offtopic)

    by Mana Mana (16072)

    .

  • KSM is a great idea, much of its abilities are available in Fedora 12. I tried it and I had higher expectations to be honest.

    That is not to say that it is no good - its great but there is a bit of a cost analsysis that should be done before implementing it. You dont get something for nothing - and in this case ultimately your offloading the higher memory usage onto the CPU. Depending on your hypervisor setup this might not be such a bad thing of course.

    In my somewhat narrow testing of it I found that:-

    a) Ev

  • Linux has had this long ago.
    http://www.complang.tuwien.ac.at/ulrich/mergemem/ [tuwien.ac.at] - for example.

    Note that the savings referred to are on kernel 2.0.33.

    I used it on my 8M laptop - worked well.

I have not yet begun to byte!

Working...