Anatomy of Linux Kernel Shared Memory 93
An anonymous reader sends in an IBM DeveloperWorks backgrounder on Kernel Shared Memory in the 2.6.32 Linux kernel. KSM allows the hypervisor to increase the number of concurrent virtual machines by consolidating identical memory pages. The article covers the ideas behind KSM (such as storage de-duplication), its implementation, and how you manage it.
First Post (Score:1, Redundant)
Re: (Score:2, Informative)
VMWare? Fuck, this has been around for decades in the case of OS/360.
A word of caution ! (Score:2)
The use of KSM isn't without risk:
"A region can be removed from the mergeable state through the MADV_UNMERGEABLE parameter (which immediately un-merges any merged pages from that region). Note that removing a region of pages through madvise can result in an EAGAIN error, as the operation could run out of memory during the un-merge process, potentially leading to even greater trouble (out-of-memory conditions). "
KSM *will* overcommit the memory as it is designed to do so. And please also be mindful that Mur
Re:First Post (Score:5, Informative)
From the article:
"Going further
Linux is not alone in using page sharing to improve memory efficiency, but it is unique in its implementation as an operating system feature. VMware's ESX server hypervisor provides this feature under the name Transparent Page Sharing (TPS), while XEN calls it Memory CoW. But whatever the name or implementation, the feature provides better memory utilization, allowing the operating system (or hypervisor, in the case of KVM) to over-commit memory to support greater numbers of applications or VMs. You can find KSM—and many other interesting features—in the latest 2.6.32 Linux kernel."
Re: (Score:1, Interesting)
The concept isn't new to Windows, VMWare or FreeBSD I know for a fact (though none of them work exactly the same as this).
I would have presumed this wasn't new to Linux either, just different from the existing implementation (I know its blasphemy here, but I'm not a Linux person).
Its certainly been done in the mainframes for god knows how long.
I doubt this is as groundbreaking as its being promoted.
If your OS isn't sharing duplicate memory blocks already, you're using a shitty OS. (Linux already shares dup
Re:First Post (Score:5, Insightful)
If your OS isn't sharing duplicate memory blocks already, you're using a shitty OS. (Linux already shares dup read only blocks for many things, like most modern OSes).
Umm, no.
Most modern OSes share memory for executable images and shared libraries. In addition, some OSes, such as Linux, support copy-on-write semantics for memory pages in child processes created with fork (note, Solaris is an example of an OS that *doesn't* do this).
Aside from that, there is no automated sharing of memory between processes. Frankly, I have no idea where you got the idea there was.
Re: (Score:2)
In addition, some OSes, such as Linux, support copy-on-write semantics for memory pages in child processes created with fork (note, Solaris is an example of an OS that *doesn't* do this).
Wait, Solaris doesn't do copy-on-write? (uh-oh)
I, too, expected that to be pretty much standard nowadays.
Re: (Score:3, Insightful)
This article [sun.com] claims that on Solaris, "fork() has been improved over the years to use the COW (copy-on-write) semantics". It's sort of an in-passing comment, though, and I can't find a definitive statement in docs anywhere (the Solaris fork() manpage [sun.com] doesn't give any details).
Re:First Post (Score:5, Insightful)
Aside from all the places that memory is shared between processes, theres no sharing between processes ... yea, I totally get you ...
That's exactly right. I pointed out all the places. *All of them*. And there's *two*: shared, read-only executable pages, and the heaps of children created by COW-enabled forks. That's it. That's all.
So any new technology for memory de-duping is impressive because, traditionally, it just ain't done. Which directly contradicts the content of your original post.
Perhaps now you understand?
Re: (Score:3, Informative)
So any new technology for memory de-duping is impressive because, traditionally, it just ain't done. Which directly contradicts the content of your original post.
For those who are still confused, the big difference between the various shared library-type schemes and memory de-dupilication is passive vs. active.
Shared libraries (or executables) take advantage of the fact that when you load an program multiple times, the same bits are obviously being loaded each time and so it's just a reference count increment.
For memory de-duplication, during idle times, the hypervisor creates hashes of all the used memory pages and if any duplicates are found they are replaced with
Re: (Score:2)
In addition, some OSes, such as Linux, support copy-on-write semantics for memory pages in child processes created with fork (note, Solaris is an example of an OS that *doesn't* do this).
What year do you live in? Solaris _9_ had COW and multiple page size support, over half a decade ago. Linux large page size support is a joke, Solaris x86 even does it better on Linux's home turf.
http://www.sun.com/blueprints/0304/817-5917.pdf [sun.com]
Most modern OSes have a native fibre channel stack, except notably, Linux which doesn't have userland utilities for managing SCSI devices or even fibre channel drivers for that matter.
See what I did?
Re: (Score:2)
What year do you live in? Solaris _9_ had COW and multiple page size support, over half a decade ago.
*snicker* I like how you phrased that as "half a decade ago"... "five years ago" sounds far less impressive, when you consider how industrial strength Solaris has traditionally been considered. :) That said, I fully concede my information is probably out-of-date. Glad to see Solaris finally moved into the 21st century!
Besides, it wasn't a criticism of Solaris (the only reason I came across the factoid at
Re: (Score:1)
Solaris is an example of an OS that *doesn't* do this
Are you sure? When the child modifies its globals or anything in its heap, the memory has to be COWed.
Re: (Score:1)
Re: (Score:3, Interesting)
Personally, I find memory compression [lwn.net] more interesting than just deduplication, which could be considered a subset of compression. The idea has been around for years. There used to be a commercial product for Windows 95 [wikipedia.org] that claimed to compress the contents of RAM, but which had many serious problems, such as that the software didn't work, didn't even attempt compression, and was basically "fraudware". The main problem with the idea was it proved extremely difficult for a compression algorithm to beat th
Re: (Score:2)
Now we have LZO, an algorithm that has relatively poor compression
No kidding. We compress trace files with LZO at my day job and the compressed versions still have large human readable chunks.
Re: (Score:1)
Re: (Score:2)
If your OS isn't sharing duplicate memory blocks already, you're using a shitty OS. (Linux already shares dup read only blocks for many things, like most modern OSes).
That depends on how the memory gets duplicated. If it is duplicated because it comes from the same library or because it is the result of forking a process, you're right, every OS does that. But if it is because the memory content comes from independent processes doing independent things and the result happens to be exactly the same, it is new. In the former case, if two processes are sharing some memory, one process decide to write over it, and the other does exactly the same write, then the result is t
Re: (Score:2)
The breakthrough nature of this is that the hypervisor or the host OS is providing a virtual machine to every guest OS in the system. A virtual machine provides an environment that mirrors the real hardware, the OS knows no better. This in theory means that you could run multiple Linux distributions with the memory of a Linux kernel only being used once, meaning more applications can be run within these guest OSes or more guest Oses.
That's why it is impressive.
Re:First Post (Score:4, Interesting)
For now, at least. VMWare doesn't support combining pages >= 2MB because the overhead (hit rate on finding duplicates versus the cost of searching for duplicates) and I suspect other hypervisors will do the same. Additionally, Intel and AMD are both moving to support 1GB page tables. What are the odds that you'll start up two VMs and their first 1GB of memory will remain identical for very long?
The only way I see page sharing working in the future is if the hypervisor inspects the nested pages down to the VM level, which will typically be the 4KB pages we know and love. Either that, or paravirtualization support needs to exist for loading common code and objects into a shared pool.
Even so, there's a lot of overhead from inspecting (hashing and then comparing) pages which will only grow as memory sizes grow. If we increase page sizes, the hit rate decreases and the overhead of copy-on-write increases. It's not a good situation.
Sources: Performance Best Practices for vSphere 4 [vmware.com] which references Large Page Performance [vmware.com] which states:
That is, page sharing involves breaking up large pages, negating their performance benefit and is only used as a last ditch when you've overcommited memory and you're nearly to the point of having to hit the disk. And VMWare overcommit is great until you hit the disk, then it's a nightmare.
Re: (Score:2)
Be interesting to see actual speed differences of favouring larger page sizes vs fewer duplicate pages... on the one side, you get fewer TLB misses which slows things down, but on the other side, you can - in effect - be increasing your L2 cache, depending on scheduling. By which I mean that if you have a page that's shared between three processes (VMs or otherwise) then any cachelines covering data within that space effectively covers three times the memory that it would have to cover if they weren't share
Re: (Score:2)
Re: (Score:3, Funny)
I start daemons myself, you insensitive clod!
i love these (Score:1)
Re:i love these -- but KernelTrap is back! (Score:1, Interesting)
There are signs of life at KernelTrap (http://kerneltrap.org/ [kerneltrap.org]).
There have been a number of postings by Jeremy since the beginning of April.
Re: (Score:2)
Re: (Score:1, Troll)
"or has linux networking changed sense 2005"
No, Mr. Ballmer - nothing has changed in Linux in years now. It's safe to stick your head back up your arse, and assume that Windows is superior to everything in the world. The schmucks out there still believe it, and sales are stable.
This Is Just One Reason ... (Score:5, Interesting)
Re: (Score:1)
Re: (Score:1)
I never understand why clueless people resort to making up facts to defend their beliefs/leaders/country/software
Physician heal thyself.
A newbie Linux programmer is pretty much fucked if he doesn't have years of prior experience.
A newbie Linux programmer is not likely to be mucking around in the kernel now is he?
Re: (Score:3, Insightful)
far better documentation from MS and Apple than Linux has ever had.
Have you ever looked at Amazon or InformIT/Safari or any technical documentation vendor or website? There are enough books and articles on MS and Linux to keep you reading for many lifetimes (Apple not so much, but still plenty by my estimate). It's just a FACT that there are certain things that closed source vendors do not disclose as a matter of trade secret or intellectual property, which is what I believe WrongSizeGlass was referring to. OSS does not have this limitation.
And no, the source doesn't count if no one knows what you intended to do
It absolutely does count, if you
Re: (Score:1)
Common sense please, no matter your expertise in programming, you can't understand some code unless documented. Example: the X Server.
Re: (Score:1)
In all my years, I've never once told my clients "I can't understand it. It's just too hard". Consider yourself lucky if you have the code. I've even had to reverse engineer code from binaries on occasion. That's hard, but again, not impossible because if the CPU can execute it, it
Re: (Score:1)
Documentation from the vendor or project shows that they care about the details.
No, it shows that if they want to sell their product (or in the case of OSS, achieve wide acceptance), they'd better have decent docs. Corporations don't care at all. If not having decent documentation did not impact the bottom line, do you really think these corporations would expend the significant resources they do to produce the documentation?
In any case, many times the best information available (with the possible exception of reference API docs) comes from third party sources such as books and article
Re: (Score:2)
"it shows that if they want to sell their product [snip] Corporations don't care at all"
Err, I think that's actually just breaking down into a language argument now, based on the fuzziness of what the word 'care' is perceived to mean. Ya know, do you 'care' about how much money you have in the bank or do you just 'care' about how much you can do of what you want to do with how much you have? Is the idea that something is just a means to an end sufficient to say that one doesn't truely care about the means b
Re: (Score:2)
What OS are your using? It's not Mac OS X nor Windows, both have OSS components.
Re:This Is Just One Reason ... (Score:5, Informative)
OS X's kernel is open source (BSD license) and very well documented.
Re: (Score:3, Insightful)
Take it from a guy who's seen the NT source code: Inside Windows 2000, the windows kernel debuggers, and a firewire cable gave you MORE than enough detail; there's not much important that's not publicly known.
It just doesn't make Slashdot or the sites you frequent. How do you think Windows Device Driver writers do their job?
Re: (Score:2, Insightful)
How do you think Windows Device Driver writers do their job?
Very badly if my experience is anything to go on.
Re: (Score:1)
Sorry, but Open != Detailed Documentation any more than Closed does. See Mark Russinovich's blog [technet.com] for way too much detail about the Windows kernel and ecosystem.
I'd also argue the MSDN site is far more comprehensive and easy to use (if they'd stop pratting about with the colours and layout every other week) than any single source of linux docs (if there even is one)
MS do seem to realise that you can't write docs assuming the reader knows what the doc is about, unlike most OSS documentation that assumes way t
IBM (Score:1)
Re: (Score:2)
sense when did IBM care so much about Linux?
The way I see it, the core businesses for IBM are hardware and services. Anything that helps feed the two is a good investment for IBM.
How does virtualization help (Score:2)
If you are running 10 processes on 10 servers on one physical machine... isn't easier and more efficient to run 100 processes on one instance of Linux?
Re: (Score:2)
In most cases where VM is useful the people who care about the 10 processes bring so much baggage in terms of demands that it pays big dividends to have the overhead of 10 machine images running in order to not have to listen to 10 people whining.
There's IT theory and then there IT reality...
Re: (Score:2)
Easier? Yes. More Efficient? Yes. More secure from threats and bugs? Most likely not. 10 processes on 10 virtual servers means that if one process takes out the server, it takes out 9 other processes, not 99 other process, unless it can actually manage to screw over the hypervisor, which is very well protected.
Re: (Score:2)
Yes, which is what operating-system-level virtualization, which is basically an extension of the old concept of chroots or jails, is intended to do: give you many of the benefits of virtualization without the overhead of having multiple full copies of the OS running. It can also manage some resources better, e.g. having a unified filesystem cache. OpenVZ [openvz.org] is Linux's approach.
However, full virtualization, like Xen, is somewhat more rock-solid in its separation of the virtual machines, and also allows more fle
Re: (Score:2)
In my case, it is because the people running those 10 servers want to have their freedom to set different kernel parameters, to install different OS packages, to run their own Apache server with their own hostname all looking at port 80, and to hold root account without fighting each other. Most importantly, top efficiency is not a concern as the servers are not heavily loaded.
Re: (Score:2)
Re: (Score:2)
Yes [apache.org]
Re: (Score:2)
That article [apache.org] starts with "These scenarios are those involving multiple web sites running on a single server, via name-based or IP-based virtual hosts"
It sounds like this is talking about how to configure a single instance of Apache to serve up different websites based on the incoming IP address or the web site domain name. It doesn't sound like it applies to running multiple virtual machines, each of which has its own copy of Apache, each of which is trying to listen to port 80.
Although if I'm wrong, I'm
Re: (Score:2)
I was replying to your comment, not the article:
How does it work for multiple copies of Apache to all be looking at port 80? I mean, from the outside world, there can only be one port 80 at that IP address, right?
Realistically you wouldn't have completely separate instances of Apache on the same machine, hence the virtualhosts stuff. When you said multiple copies of Apache I assumed you meant they would be on the same server because if they were on VMs it doesn't make sense to say "multiple copies of Apache trying to listen to port 80".
It doesn't sound like it applies to running multiple virtual machines, each of which has its own copy of Apache, each of which is trying to listen to port 80.
Your VMs would have separate IP addresses with one copy of Apache per VM. If that is a problem then make it NAT and put a proxy [wikipedia.org] in fron
Re: (Score:2)
How does it work for multiple copies of Apache to all be looking at port 80? I mean, from the outside world, there can only be one port 80 at that IP address, right?
Each VM normally gets its own IP, distinct from all other VMs and the host.
Re: (Score:2)
The only difference is on network namespace seperation. By default many things will listen on "port x" for IP address 0.0.0.0, which means, for example, Apache may eat up all port 80s for all IP addresses on the machine. Virtual machines get their own IP address and so you don't have to configure Apache to only listen on one, for other instances of Apache to be able to listen on their own. Somebody screwing up their configuration and accidentally listening on all port 80s won't stop the next persons Apache
Re: (Score:2)
If you are running 10 processes on 10 servers on one physical machine... isn't easier and more efficient to run 100 processes on one instance of Linux?
That depends entirely on how you measure "easier" and "efficient".
Politics (Score:2)
No, really.
Doesn't necessarily do anything (Score:2)
Kernel shared memory only acts on memory pages which it has been advised could be duplicates. This requires applications to specifically tag pages of memory as possibly being duplicates.
Useful for virtualization (which is the primary purpose), but probably not actually functionally useful for more general memory de-duplication.
Re: (Score:2)
Why don't you try reading the article.
"What you'll soon discover is that although memory sharing in Linux is advantageous in virtualized environments (KSM was originally designed for use with the Kernel-based Virtual Machine [KVM]), it's also useful in non-virtualized environments. In fact, KSM was found to be beneficial even in embedded Linux systems, indicating the flexibility of the approach."
Re: (Score:2)
Maybe you should take your own advice.
KSM relies on a higher-level application to provide instruction on which memory regions are candidates for merging. Although it would be possible for KSM to simply scan all anonymous pages in the system, it would be wasteful of CPU and memory (given the space necessary to manage the page-merging process). Therefore, applications can register the virtual areas that are likely to contain duplicate pages.
Re: (Score:2)
But it can be applicable to non-virtualized apps. Furthermore, I see no reason that couldn't be "advisory" in nature, but still do a global scan if needed. The article didn't really say anything about that possibility.
Re: (Score:2)
Are you serious or are you a troll? The reason not to do a global scan is in the quote I just gave you form the article.
Although it would be possible for KSM to simply scan all anonymous pages in the system, it would be wasteful of CPU and memory (given the space necessary to manage the page-merging process).
You don't do a full memory scan because the red-black tree would more than double the memory usage during the scan.
Good on them. (Score:1, Insightful)
First off, as several people have said, IBM did VM over 40 years ago, hypervisors, full hardware virtualization, virtual memory, loads of failover type stuff since they are mainframes after all. Right now the modern wave of VMs are about where IBM was back then (with perhaps failover still being perfected.) IBM mainframe CPUs run the pipeline 2x, with a comparator in between the 2 copies to make sure everything matches. After 1 fault, it backs the pipeline up one and reruns it (and I'm sure logs a
KSM (Score:2)
"KSM allows the hypervisor to increase the number of concurrent virtual machines by consolidating identical memory pages."
But first you have to waterboard it.
flying gristle (Score:1, Offtopic)
.
I've tried it and I was disappointed. (Score:1)
KSM is a great idea, much of its abilities are available in Fedora 12. I tried it and I had higher expectations to be honest.
That is not to say that it is no good - its great but there is a bit of a cost analsysis that should be done before implementing it. You dont get something for nothing - and in this case ultimately your offloading the higher memory usage onto the CPU. Depending on your hypervisor setup this might not be such a bad thing of course.
In my somewhat narrow testing of it I found that:-
a) Ev
Old idea is old. (Score:2)
Linux has had this long ago.
http://www.complang.tuwien.ac.at/ulrich/mergemem/ [tuwien.ac.at] - for example.
Note that the savings referred to are on kernel 2.0.33.
I used it on my 8M laptop - worked well.