Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Red Hat Software Businesses AMD Software Technology

Red Hat & AMD Demo Live VM Migration Across CPU Vendors 134

An anonymous reader notes an Inquirer story reporting on something of a breakthrough in virtual machine management — a demonstration (not yet a product) of migrating a running virtual machine across CPUs from different vendors (video here). "Red Hat and AMD have just done the so called impossible, and demonstrated VM live migration across CPU architectures. Not only that, they have demonstrated it across CPU vendors, potentially commoditizing server processors. This is quite a feat. Only a few months ago during VMworld, Intel and VMware claimed that this was impossible. Judging by an initial response, VMware is quite irked by this KVM accomplishment and they are pointing to stability concerns. This sound like scaremongering to me ... All the interesting controversy aside, cross-vendor migration is [obviously] a good thing for customers because it avoids platform lock-in."
This discussion has been archived. No new comments can be posted.

Red Hat & AMD Demo Live VM Migration Across CPU Vendors

Comments Filter:
  • by stabe ( 1133453 ) on Friday November 07, 2008 @12:29PM (#25676659)
    Xen supports this feature since Xen 3.3, it is called CPUID: http://www.nabble.com/Xen-3.3-News:-3.3.0-release-available!-td19106008.html [nabble.com] No real breakthrough here...
  • Re:Bravo! (Score:4, Informative)

    by 2names ( 531755 ) on Friday November 07, 2008 @12:45PM (#25676849)
    We have certainly come a long way when a Cornwallis supports freedom of the people. :)
  • by LinuxGeek ( 6139 ) * <djand...nc@@@gmail...com> on Friday November 07, 2008 @01:04PM (#25677085)

    This is a demo of a Live migration, no shutdown or reboot involved. Xen does not support the live migration of a running VM between an AMD and Intel server. Watch the video, they are running a video in the VM that keeps playing during the migration. Very impressive stuff.

  • by michrech ( 468134 ) on Friday November 07, 2008 @01:48PM (#25677507)

    It didn't seem that interesting to me. If you watch the video, the Intel and Barcelona machines showed no VM's running (0% load). When the Shanghai server took over the load, *of course* it's load line will rise -- it's the only server running a VM at that point!

    There are no shenanigans going on here, and I don't think this says anything about the chips as you imply, either.

  • Re:Umm... (Score:3, Informative)

    by TheRaven64 ( 641858 ) on Friday November 07, 2008 @02:01PM (#25677649) Journal

    On most other setups you'd have to shut the VM down and then restart it on the other machine for it to work correctly

    Do you? I first saw Xen demo live migration in 2005, and I don't think it was new then. Their demo had a Quake server being thrown around a cluster without clients noticing. Downtime was well under 100ms. You can read the paper [cam.ac.uk] for more information.

    They were claiming that you can move between processor types, but they didn't specify how much different they could be. If it's just a matter of SSE or 3DNow! support disappearing then that's not a hard problem - just trap-and-emulate any of the old instructions. Relaunching programs that use these will cause the new values of CPUID to be picked up.

  • by kscguru ( 551278 ) on Friday November 07, 2008 @02:09PM (#25677777)
    Yet Another VMware engineer here.

    The new Intel/AMD CPU features that allow masking of CPUID bits while running virtualized also make processors recent enough that most of the interesting features are present - MMX, SSE up to ~3. The "common subset" ends up looking like an early Core2 or a Barcelona (minus the VT/SVM feature bits, of course) - Intel and AMD run about a generation behind on adding each other's instructions. Run on anything older than the latest processors, and you have to trap-and-emulate every CPUID instruction. Enough code still uses CPUID as a serializing instruction that this has noticeable overhead.

    So there are two strategies. Pass directly through the CPUID bits (and on the newest processors, apply a mask), or remember a baseline value, trap-and-emulate every CPUID and always return that value. Sounds like KVM has picked the latter approach for a default; VMware's default is to expose the actual processor features and accept a mask as an optional override, which skews towards exposing more features at the expense of some compatibility. Equally valid choices, IMHO.

    The Worst Case Scenario when not doing a trap-and-emulate of every CPUID is an app that does CPUID, reads the vendor string, then decides based on the vendor string which other CPUID leafs to read. (Like the 0x80000000 leafs, which are vendor-specific and would come back as gibberish if you get the processor wrong). If the app migrates during the dozen or so instructions between the first CPUID and the following ones, instant corruption. Good enough for a pretty demo, destined to make a guest kernel die a few times a year if actually used in production. And I'm 95% sure this is what the OP demo is doing - living dangerously by hoping mismatched CPUID results never get noticed.

    I agree with Anthony Liguori here - on a production machine, an Intel/AMD migration is way too much of a stupid risk. All you have to do is reboot the VM, it's much safer.

    (As a side note to everyone reading, the reason Linux timekeeping is such a problem is that TSC issue. Intel long ago stated TSC was NOT supposed to be used as a timesource. Linux kernel folks ignored the warning, made non-virtualizable assumptions, and today are in a world of hurt for timekeeping in a VM. And only now, many years later, are patching the kernel to detect hypervisors to work around the problem.)

  • by TheRaven64 ( 641858 ) on Friday November 07, 2008 @02:09PM (#25677793) Journal
    Actually, it is suspended, but only for a fraction of a second. First you copy the entire contents of memory to the new machine and mark it as read-only. Each page fault caused by this is used to mark pages that are still dirty. Then you copy these. You keep repeating this process until the set of dirty pages is very small. Then you suspend the VM, copy the dirty pages, and start the VM on the new machine. Userspace programs will just notice that they went an unusually long time without their scheduling quantum. With Xen, at least, the kernel is responsible for bringing up and shutting down all CPUs except the first one, so the kernel will notice the migration (in a paravirtualised kernel - with HVM it won't) and restart the other (virtual) CPUs.
  • by nabsltd ( 1313397 ) on Friday November 07, 2008 @03:11PM (#25678987)

    VMware doesn't require "identical" hardware to do live migration, either.

    It does have to be similar enough, which at this point pretty much means just the same processor manufacturer. As long as the processor supports the hardware virtualization, then VMware will allow you to set up a cluster that will allow live migration with no issues.

  • Re:Umm... (Score:3, Informative)

    by nabsltd ( 1313397 ) on Friday November 07, 2008 @03:17PM (#25679085)

    And, when you think about it, any instruction that you would have to trap if the VM used to be running on a different processor must be trapped at all times.

    This is because you have no way of knowing which processor type the VM was first started on. When this happened, it's likely the OS did some hardware checking and figured out which instructions it could (and could not) use. Moving the VM isn't going to change what the OS believes is the processor, and that's the problem.

    Overall, VMware's Enhanced VMotion Compatibility method of lying to the OS about the capablilities of the processor seems to be the easist way of doing this. But, they only do it within one CPU manufacturer, because otherwise you'd end up with a very low-featured virtual processor.

  • by Anthony Liguori ( 820979 ) on Friday November 07, 2008 @03:34PM (#25679443) Homepage

    The new Intel/AMD CPU features that allow masking of CPUID bits while running virtualized also make processors recent enough that most of the interesting features are present - MMX, SSE up to ~3. The "common subset" ends up looking like an early Core2 or a Barcelona (minus the VT/SVM feature bits, of course) - Intel and AMD run about a generation behind on adding each other's instructions. Run on anything older than the latest processors, and you have to trap-and-emulate every CPUID instruction. Enough code still uses CPUID as a serializing instruction that this has noticeable overhead.

    Modern OSes do not use CPUID for serialization. We trap CPUID unconditionally in KVM and have not observed a performance problem because of it. Older OSes did this but I'm not aware of a modern one.

    My understanding of the reason for the recent CPUID "masking" support is because if you are not using VT/SVM (Xen PV or VMware JIT), there is no way to trap CPUID when it's executed from userspace. AMD just happened to have this feature so when Intel announced "FlexMigration", they were able to just document it. I don't think it's really all that useful though.

    (As a side note to everyone reading, the reason Linux timekeeping is such a problem is that TSC issue. Intel long ago stated TSC was NOT supposed to be used as a timesource. Linux kernel folks ignored the warning, made non-virtualizable assumptions, and today are in a world of hurt for timekeeping in a VM. And only now, many years later, are patching the kernel to detect hypervisors to work around the problem.)

    The TSC is often used as a secondary time source, even outside of Linux, but yes, Linux is the major problem. But Windows it not without it's own faults wrt time keeping. Dealing with missed timer ticks for Windows guests is a never ending source of joy. Virtualization isn't the only source of problems here. Certain hardware platforms have had overzealous SMM routines and the results was really bad time drift when running Windows.

  • Re:Bravo! (Score:1, Informative)

    by Anonymous Coward on Friday November 07, 2008 @04:37PM (#25680563)

    With Obama at the helm, you may not have guns to protect your liberty, so death is more likely. ;)

  • by Anonymous Coward on Friday November 07, 2008 @07:41PM (#25683279)

    True, it is higher but the guy mentions each server is running several VMs (each of which could be doing stuff), not just the one. Also the scale of time isn't visible from the start of migration until finish. Not sure it shows anything really but well spotted.

Our OS who art in CPU, UNIX be thy name. Thy programs run, thy syscalls done, In kernel as it is in user!

Working...