Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Linux Software

UNIX Process Cryogenics? 555

shawarma asks: "Due to a recent power outage, I've had to shut down a server running a process that had been running for ages calculating something. The job it was doing would have been done in a few days, I think, but I had to shut it down before the UPS ran out of juice. This got me thinking: Why can't I freeze down the process and thaw it back up at a later time? It ought to be possible to take all the connected memory pages and save them in some way, preserve file handles and pointers, and everything. Maybe net-connections would die, but that's understandable. Has any work been done in this field? If not, shouldn't there be? I'd like to contribute in some way, but I think it's a bit over my head.." Laptops have been doing this in some form for years: most laptops, when they run out of power, or when told by the user will go into "suspend" mode which is similar to what the poster is describing, however outside of laptops, I haven't seen this done. Sleeping processes also do something similar, sending their memory pages into swap so other running processes can use the memory. What, if anything, is preventing someone from taking this a step further?
This discussion has been archived. No new comments can be posted.

UNIX Process Cryogenics?

Comments Filter:
  • is not suspend, it is hibernate. Suspend will power down the computer except for the energy needed to keep the ram alive. hibernate will save all data to from memory to disk. I, personally, use neither on my laptop.
    • What you refer to as suspend is what most people (and APM) call standby. What you call hibernate is what APM refers to as suspend. I believe Windows uses the term hibernate to refer to a software suspend function.
  • Of course, you could write your application so that it saves state at regular intervals (aka checkpointing). Especially with calculations you should be able to store intermediate results.
    • Easier said than done. If this wasn't part of the application's design or if it's relatively sophisticated, making these changes can be non-trival. And (shock/horror) if you don't have the source code, it's impossible without OS assistance.
  • by interiot ( 50685 ) on Friday January 25, 2002 @02:02PM (#2902115) Homepage
    External dependancies might include open files (what if you freeze, and then delete the file?), open TCP sockets to daemons elsewhere that wouldn't get frozen, sub processes, etc... These would probably have to be revived, but how?
  • We do it in Condor (Score:5, Informative)

    by epaulson ( 7983 ) on Friday January 25, 2002 @02:02PM (#2902118) Homepage
    http://www.cs.wisc.edu/condor/

    Free-as-in-beer, on most major UNIX platforms. Check out our publications, we have several that give all the details you'd need to write it yourself.

    Plenty of others, too - libckpt, there was a "Checkpointing Threaded Programs" paper at USENIX this past summer... there are some kernel patches that can do, most of them under the GPL.
    • by dsouth ( 241949 ) on Friday January 25, 2002 @03:19PM (#2902796) Homepage

      As the poster said, there are plenty of others:

      • SGI IRIX [sgi.com] and Cray UNICOS [cray.com] provide kernel-level checkpoint-restart.
      • Condor [wisc.edu] provides user-level checkpoint restart and process migration by manipulating libraries at runtime.
      • esky [anu.edu.au] provides user-level checkpoint restart under Solaris and Linux via runtime library manipulation.
      • crak [columbia.edu] provides kernel-level checkpoint restart for linux.
      • cocheck [tum.edu] provides user-level checkpoint-restart.
      • libckpt [utk.edu] provides user-level checkpoint-restart.


      I'm sure I left serveral out. Checkpoint-restart has been part of the high-performance computing scene for years. Having been a systdmin on large, high-performance, computing platforms for the last few years of my professional life, my experiences with checkpoint-restart have been a mixed bag. All of the existing systems have limitations. Depending on the application, those limitations can be no problem, or they can be deal-breakers.
  • by kilgore_47 ( 262118 ) <kilgore_47 AT yahoo DOT com> on Friday January 25, 2002 @02:02PM (#2902121) Homepage Journal
    for the "Classic" environment. It seems so stupid watching macos9 boot up in a window when you want to use a classic program; Apple ought to save the state of the classic environment in to a file that could be quickly reloaded into ram when classic is called for. As the blurb said, laptops have had the suspend feature for years; would it really be so hard to apply the same concept elsewhere?
    • Well, OS X certainly can sleep (both OS X and Classic go to sleep), putting to sleep also all processes. As to hibernating the Classic environment, I don't know how useful that would really be in the long run.
      • by ncc74656 ( 45571 ) <scott@alfter.us> on Friday January 25, 2002 @04:39PM (#2903433) Homepage Journal
        Well, OS X certainly can sleep (both OS X and Classic go to sleep), putting to sleep also all processes. As to hibernating the Classic environment, I don't know how useful that would really be in the long run.

        I don't know how directly comparable this example might be, but I used to use VMware (under Linux) to suspend Win98 when I didn't need it. If I needed to do something under Win98 (like browse the web), VMware would load up Win98 where I last left it. It saved the minute or so of waiting for the VM to POST and load Win98.

        (If VMware provided better support for DirectX, I might not have needed to switch my home workstation from Linux to Win2K. It's been more than a year since I checked, though, so things might've improved.)

    • Errrr... Without protected memory spaces, I _don't_ think that this is what you want. You'd actually be setting yourself up for more problems. You don't want to save the system's memory state unless you can be sure that it's relatively clean & safe...
      • I think what he means is save the clean boot-up state of the classic environment (provided nothing has changed in the System folder since the last boot of classic). That way when classic needs to boot, OS X could just throw up a booted classic environment memory state in a matter of seconds instead of booting classic from scratch each time.

        - j
        • You'd have to define what you mean by "nothing has changed in the System folder", since prefs, for example, can change all the time. I suppose if you checked the image against the latest modification time of all files in the system folder, and threw away the image if the image was older than any file, it would work, but it seems that it could be pretty time consuming to do.
    • Which is funny, because VMware has exactly this capability.

      It needs some refinement, and sometimes it's slow when it picks back up again, but it generally works in my experience. It is obviously not only possible, but implementable using current technology

  • I had Be installed for a while and I thought it would do that. I do know I never lost anything due to it crashing. Of course, it didn't crash much. I think using a journaled file system or at least soft-updates would be a good start. Frankly, I have no idea how to code something simlar to Win XP hibernate. Shouldn't be that hard though.
  • by crow ( 16139 ) on Friday January 25, 2002 @02:04PM (#2902130) Homepage Journal
    What you want is known as "checkpointing."

    There have been a number of projects that do this under Unix over the years. Many of them do it for the purpose of process migration. Others do it just for recovery.

    One such project that I used in the early 90s was Condor.

    The typical approach is to do something along the lines of forcing a core dump and then doing some magic to restart the process from the core file.
  • by GeorgieBoy ( 6120 ) on Friday January 25, 2002 @02:04PM (#2902137) Homepage
    VMware suspends to disk. You can go as far as suspending the Virtual Machine, not Virtual Memory. Then copy the "data" files to another machine and resume the same suspended virtual machine like nothing ever happened, as long as the same basic hardware exists on the host system (e.g. NIC, sound, serial ports, etc).

    While this isn't quite what you are looking for, it spawn an idea of the level this can be taken to. Think of how neat it is for distributed applications. Of course, something like this has to exist somewhere. . .
  • Extended core dump? (Score:5, Interesting)

    by The G ( 7787 ) on Friday January 25, 2002 @02:05PM (#2902140)
    Almost all of the stuff you need is already in a core dump. Perhaps the appropriate approach to this is to try to extend the core-dumping mechanism to also dump other pieces of state. Then you would just need a way to reconstruct process state from a core dump, which most runtime debuggers can almost do anyway.

    I suspect that all the pieces of a solution are written and it's just a tricky pick-choose-and-integrate problem.

    And damn but I'd love to have this ability.
    --G
    • by ianezz ( 31449 ) on Friday January 25, 2002 @04:45PM (#2903493) Homepage
      GNU Emacs basically does this to reduce initialization times.

      When compiling Emacs from the sources, the initial executable file is only a (relatively) small virtual machine executing elisp bytecode.

      Then, it is started, and several basic elisp packages are loaded and initialized.

      Once initialized, it makes a dump of itself on a file on disk (IIRC actually dumping core by sending a fatal signal to itself).

      The dump is prepended with an appropriate loader which restore the Emacs process (in its initialized status) in memory, and the resulting file is used as the main Emacs binary (what you can usually find in /usr/bin).

      This works for Emacs because it knows when it is checkpointed, and special care is taken not to do anything that depends on parts of the running environment that can't be fully restored.

  • hhgttg (Score:3, Funny)

    by Score0, Overrated ( 550447 ) on Friday January 25, 2002 @02:05PM (#2902141) Homepage
    The job it was doing would have been done in a few days,

    In that case, Arthur Dent should know the answer.
  • eros-os (Score:2, Interesting)

    by ischarlie ( 159465 )
    back in the day there was a post:

    http://slashdot.org/article.pl?sid=99/10/28/015121 2&mode=thread [slashdot.org]

    about an operating system with "journaled" processes of a sort, that would automatically back up images of it's processes.
  • you can (Score:5, Informative)

    by Lumpy ( 12016 ) on Friday January 25, 2002 @02:05PM (#2902152) Homepage
    It's called software suspend for linux. look for it on freshmeat.net
    • Re:you can (Score:5, Informative)

      by Lumpy ( 12016 ) on Friday January 25, 2002 @02:09PM (#2902209) Homepage
      AHA! I knew I still had it
      http://falcon.sch.bme.hu/~seasons/linux/swsusp.htm l [sch.bme.hu]

      this is what you need.
      • Re:you can (Score:2, Insightful)

        by Anonymous Coward
        Talk about the ultimate in karma whoring. Instead of just having one post modded to +5, you get two by delaying the posting of your link. It's almost criminal.
    • Re:you can (Score:3, Informative)

      There's just one tiny little problem with that. It only supports ext2. Try it with a journalling filesystem, and ... bye bye Linux partition!
      At least, last time I checked that's how it was. There may have been improvements made. It would require somewhat major changes to the VM and each filesystem in the current Linux implementation to get it working with journalled systems, or if Linux finally gets a journal-capable VM (similar to IRIX's, perhaps), it would just require some VM changes if it's done right.

      (Begin semi-OT stuff)
      Oh, and please, please everyone ask Linus not to rip out memory zones just because it's a BSD-like idea.

      Kernel 2.6 will probably be able to support hibernation without funkiness in the filesystems themselves, just a good VM setup. The new framebuffer system (Ruby) will rock, too (think 'echo "640x480-16@60" > /dev/gfx/fb/0/mode'), especially because DRI is going to be separated from X so console applications can take advantage of OpenGL as well.
  • There has been a lot of work done on "process migration". That is moving processes from machine to machine.
    Obviously those techniques would apply to what you are asking about.
    google has lots of links about it [google.com]
  • by spacefem ( 443435 ) on Friday January 25, 2002 @02:06PM (#2902168) Homepage
    I once had an enourmous computer working out a very important question but it was destroyed by Volgons five minutes before it was finished. I feel your pain.
  • through my engineering library and I found a similar situation. A massive computer system, completely one of a kind, was destroyed prior to providing the solution to the problem for which it was designed. Recalculating the solution from scratch would take far too long, but there was one possibility. One of its computational units was still intact and the answer was surmised to be embedded deep within its memory.


    I think the same solution would apply here: Find Arthur Dent.

  • The answer is 42. :D
  • I've always wondered how hard it would be to resurrect a core file. One would think that there's enough info in a complete core to reopen all the open fd's, and possibly even reinitiate network connects. Everything else is there-- program counter, stack, heap, etc. As such, one could 'kill -ABRT' the process and revive it again later. Has anyone seen this done?
  • Suspend (Score:4, Informative)

    by selectspec ( 74651 ) on Friday January 25, 2002 @02:08PM (#2902189)
    You can't just serialize and page out one process. Under every process are a slew of kernel objects and kernel crud including the virtual to physical mappings of your address space. It would be quite a challenge to isolate all of this and somehow persist it.

    To make suspend work, you'd have to dump your entire memory image to disk. Then you swap in the entire image, kernel and user pages alike.
    • Which is exactly how windows does it. This even seems to work with memory-intesive games that manage thier own swap, like Diablo 2
  • 1) Produce the core dump of a process
    2) Use the core and process image to restart it
    (for example in the debugger such as gdb, if you
    don't want to write specialized software).

    To the best of my knowledge perl "compiler" uses
    precisely this technique to produce perl "executables" - dumps them out as a core right
    after compilation and reuses it later on.

    You can do this to a kernel as well, if you
    REALLY want to.

    However, since indeed many things may be dependant
    on state of kernel, files, network connections, devices etc. etc. doing this is not adviseable.

    Good coding practice for long-running processes is
    to actually spend some time on writing the state
    saving functionality to support process restart.

    Anyway, (call it a flame if ya will) but the fact
    that /. posts this as a relevant question is very
    disquieting - level of technical knowledge here
    gets reduced day after day.
  • by morcheeba ( 260908 ) on Friday January 25, 2002 @02:08PM (#2902194) Journal
    I've used the Suspend/Resume [hta-bi.bfh.ch] feature on a sun box. IIRC, it mostly worked, but with a minor hitch that made me worry enough to never do it again. This suspend/resume is just like the laptop version -- save a copy of all memory to disk -- not the cryogenic per-process version you're talking about.

    The per-process sounds neat, but usable only if you've got a simple critical task you're running. For a more complicated application, multiple processes may be working together, and you'd have to suspend all of them at the same time.
    One big question I would have would be file handles... if you restore a process that thinks it owns file handle #5 and some other process is already using it, it would be awkward to get either process to use a different handle.
    • A file descriptor is a per-process entity. Yes, there's a big table of file descriptors that exists for the entire sstem, but file descriptor 5 for process a is not file descriptor 5 for process b. Not even if they point to the same file/pipe. A case in point is FD 0, aka stdin. Every process starts out with a stdin on FD 0.

      More important is how do you tell the kernel what file descriptor 5 pointed to? What if the file/pipe doesn't exist any more?
  • by gehrehmee ( 16338 ) on Friday January 25, 2002 @02:08PM (#2902199) Homepage
    First, let me say that what the poster is suggesting sounds a little more sophisticated then a simple re-implementation of XP's hibernate function, although functionality like that under UNIX would certainly be invaluable. It sounds like the poster wants control over individual processes, something that I consider far more interesting.
    What's said here is certainly very reasonable. But the extensions of whats being suggested are even more fantastic. Once a process is completely removed from memory, with file handles and storage and status all kept away safely, is there any reason that the process is really tied to that computer? Why wouldn't it be possible to take that 'frozen' process, transfer it to another machine with access to the same filesystem on some level (some translation of file handles would likely be neccesary), and thaw it there, allowing someone to move a running process to another machine? Need to replace your web server's only CPU, but don't want downtime? Move the process to a backup machine, replace the original's hardware, and move the process back.
    I even thought I had heard that someone was working on just such a project, or at least thinking about the details of implementing it. (I'm just getting started in learning UNIX internals myself). Anybody have more references to information on this sort of thing?
  • A different solution, which is very common for long running processes, is to use savepoints, i.e. save the state of the process regularly to a file at suitable points of the algorithm. Once your process dies or you killed it, you can restart from that savepoint. If your state information is very large, you can stretch the save interval to reasonable long times, e.g. several hours. Typically you don't mind to lose some hours of calculations due to an occasional power outage.

    Of course this solution is not as general as the "process cryogenics" you describe, but it's also easier to implement because you have more information about the problem.
    • Yes, this is similar to what I've done in applications, especially easy in an OO environment. Coded correctly you can view your process as a virtual machine, one that has a fixed instruction set. Serializing all of the data and dumping it to file will allow you to pick up where you left off. Of course this is per application, but it's is relatively simple to build into your app when you write it.
  • There's no reason why you can't do it either in an app by saving state or in the OS by saving memory to disk as on a laptop.

    GEOS had the concept of state-saving in the OS circa 1990, so it's nothing new. The UI saves its state, what apps are running, what windows are open, etc. and restores it exactly as you left it when you restart. If an app has extra data to save, such as where it was in a lengthy computation, it can save it, too.

    A slightly different approach than brute-force writing out all of used memory, but both work quite well with the speed of current hard drives.

  • Checkpoint/restart (Score:3, Interesting)

    by td ( 46763 ) on Friday January 25, 2002 @02:09PM (#2902212) Homepage
    This facility is called checkpoint/restart. It was a feature of OS/360 and other operating systems in the 1960s. In some very early versions of Unix, core files were restartable. Usually it's pretty easy for programs to save enough state to be restartable on a case by case basis, except when it's just about impossible (like when networks reconfigure) so it's not a popular system feature these days (hard to implement in a general way, doesn't do a very good job in the cases that can be handled easily.)

    A friend of mine (Hugh Redelmeier) ran a very long (~400 day) computation on a PDP-11 in the mid-1970s. The program ran stand-alone, and part of the test plan involved flipping the power switch on and off a few times -- very amusing to watch the program keep on running right through power failures. (Main memory on the machine in question was magnetic cores, which are non-volatile.)
    • I was peripherally involved in some early efforts to include checkpoint/restart in POSIX with respect to standardizing fault tolerance and high availability features. I was a US DoD employee at the time. The military's interest was to be able (in a semi-portable standard way) to reset to a known good previous state in the case of some arbitrary failure mode in safety critical systems, i.e. flight controls, stores (weapons) management, etc. AFAIK, the POSIX standards efforts never went very far due to many different, sometimes conflicting needs. The more business-oriented high availability people had needs for very similar OS functionality that was markedly different in character from the military's viewpoint. My involvement ended in the early to mid 90's, so my understanding of the situation may be more than a little stale.

  • VMWare (Score:2, Informative)

    by Creedo ( 548980 )
    Vmware does this for the VM's it hosts. Works great.

    Creed
    • While I love VMWare, it does consume a substantial amount of CPU/memory. The problem is a job like what the original poster described is usually CPU or IO bound, and VMWare just starves the process from what it needs even more.

      Granted, it is a solution, but your job that ran in 3 days just got pushed out to a week. It's just a tradeoff.

      What the poster really needs is to rewrite the program to drop intermediate data along the way. If you have hourly checkpoints you can minimize the amount of data lost. How to implement checkpoints is left as an exercise to the reader :)
  • by blair1q ( 305137 ) on Friday January 25, 2002 @02:10PM (#2902216) Journal
    Any program that you intend to run for more than a day or two you should checkpoint its intermediate results to disk, even if this adds 100% to the run time.

    --Blair

    P.S. Alternatively, you could write a program to have the rebooted computer pull scrabble tiles from a bag structure and print them to the screen. You might at least get some clue as to whether it was asking the right question.
  • User Control (Score:2, Interesting)

    by Skweetis ( 46377 )
    It would be neat if this could be controlled by the user. Ideally, this would be done by a process signal. To actually cause a process to hibernate, a user would do a kill -HIB $PID or something like that. Then the kernel would save the process information to a file (somewhere under /var maybe?) until it is restored.

    This next one would complicate things a bit: the user should also be able to wake up the process the same way, i.e. kill -WAK $PID. This means that an index of hibernated processes also needs to be kept synchronized between the kernel process tables and a file on disk, to be preserved between reboots.

    Maybe I'll write another kernel patch...

  • by jstott ( 212041 )
    Look at the makefile for emacs--the emacs executable is essentially a memory dump of a partially initialized emacs process. Perl's dump and undump work the same way.

    For long-running processes, rather than shut down the process when the UPS kicks in, I've always found it easier to have the program snapshot its data tables periodically (say every half-hour) and build a "resume from disk" feature into the program. This lets you restart the program from its last check-point even in the event of uncontrolled program termination (e.g. kill -9 and the like).

    -JS

  • The main reason this "suspend" feature works relatively well for a laptop is because the hardware is a "given". The laptop has to have a certain video card and motherboard chipset, specific type of hard drive, floppy, CD-ROM and sound device. (In fact, when laptops fail to come back up properly from a suspend, it's almost always the one "add-on" card people have in laptops, the PCMCIA network adapter, that causes the problem.)

    3Com PCMCIA cards are about the only ones I've used that allow the laptop to power them down and back up again, and resume network activity without a complete machine reboot.
    • This is why VMware suspend works the way it does. It provides a consistent virtualized hardware interface, regardless of the details of the real hardware. The original question referred to individual process saving, and VMware suspend is similar to the whole OS suspend feature in laptops. Nevertheless, if you consider VMware to be a wrapper for individual processes that you want to be able to checkpoint, it turns out to be quite a nice solution to the original problem with zero programming required, and just a little pocket money to implement.

      bb
  • The comments to the effect of "it's called hibernation, and has done it for years" are missing the point. That hibernation is a BIOS supported dump to disk. It's a feature on most laptops and works with just about any OS -- it's worked on my Linux laptop for years.

    I think the feature to be discussed is Operating System (not BIOS) level support of the hibernation of a single process. It'd be nice if I could do a:

    kill -HIBERNATE `cat /var/longoperation.pid`

    and have that program get frozen to disk. Then if I could resurrect just that process later it'd be a handy feature for the long running program that you want to postpone until after you've done whatever you needed to do in single user mode.
    • by Hrunting ( 2191 ) on Friday January 25, 2002 @02:31PM (#2902377) Homepage
      And if you have something like that, you open yourself up to a wealth of potential problems in the program. Take this simple perl script.

      #!perl

      use strict;

      my $pid = $$;
      print $pid


      If you stop it between those two $pid commands, there's no guarantee that you're going to get the same pid value back. Programs would have to be specifically programmed to handle this sort of thing (there are other examples, this is just the most basic; network programs particularly would have problems).
      • There are lots of other issues. If a program has a socket, or a device open, what should happen? Should the OS reopen the socket? What if the remote end is requiring status. No point reopening a FTP session if the application thinks it's already sent the userid/password but the server doesn't. What if it's a device, eg a modem, and it is locked?
  • by bartman ( 9863 )
    There are big problems with such an approach, and mainly with device usage. Basically they are all the problems that you would have with process migration add a few because of temporal discontinuity.

    If you are using a scanner, or a mouse, or whatever, that device may not be there or may not be available when the process is brought back. Furthermore you may have a file descriptor opened on a local (or network shared) file which no longer exists or has changed drastically.

    There are further non-device-dependent problems with shared memory, opened-but-unlinked files, parent PID, IPC resources.

    Having said all of the above... I suppose that for the very rare case that your program is completely memory and CPU dependent you could retire and recover a task.

    my $0.02
  • by zaius ( 147422 ) <jeff@zaius.dyndns . o rg> on Friday January 25, 2002 @02:12PM (#2902244)
    Apple implemented this feature in early versions of OS 9, but took it out after they realized that some laptops would never "unfreeze" without the user hitting a reset switch buried deep inside the laptop.

    The idea was that when you put your computer to sleep, instead of keeping the SDRAM (or whatever the laptop had) powered to preserve the memory contents, it would write it all to a special sector on the hard drive that the firmware knew to read from when starting from sleep. This allowed sleep to be even more low-power than it already is, since a hard drive does not require power to retain data.

  • EPCKPT (Score:5, Informative)

    by cmason ( 53054 ) on Friday January 25, 2002 @02:12PM (#2902245) Homepage
    EPCKPT [rutgers.edu] is a checkpoint/restart utility built into the Linux kernel. Checkpointing is the ability to save an image of the state of a process (or group of processes) at a certain point during its lifetime.

    --

  • If you could sleep processes you could run some intensive job at a high priority when your not logged into your workstation and then sleep the processes when you log in. This way you could run some job that takes weeks or months but not bog down a workstation that you need for doing daily work on.

    Yeah, you could "nice" down the process so that it doesn't slow things down while your logged in... but then system processes at higher priorities might slow down your number crunching when you're not logged in... It'd be best to be able to run it at high priority at night only.... ya know, use those unused cycles.
  • One fairly simple alternative is to simply have the application save it's own state to a "checkpoint" file periodically. This approach has been used in other applications for a long time in the form of auto-save files (ie: emacs) and would be easily adapted to a long running program like the one you describe.

    Just because the OS doesn't support it automagically it doesn't mean that you can't solve it for yourself with a little bit of extra work and planning.
  • Software suspend (Score:2, Informative)

    by Timbo ( 75953 )
    Linux software suspend [sch.bme.hu] may be of interest.
  • Long ago and far away (about 15 years ago) I recall that TeX was frequently built in a fashion that required running the binary on some "initialization" information. That process took some nontrivial amount of time back in those days (I'm sure now it would be an eyeblink), and the program could be made to \dump its state in some way.

    Then, when you ran TeX in everyday circumstances, the digested initialization file was read in by the application as part of the usual startup process.

    I'm probably botching the explanation of how this really worked, but I guess my point is that the "resume" function had to be coded into the specific application.

  • by doorbot.com ( 184378 ) on Friday January 25, 2002 @02:18PM (#2902294) Journal
    If you have a Windows 2000 or XP machine you can enable hibernation. However, this is not a "power management" feature... it has been separated from ACPI and/or proprietary disk partitions and will work on all computers, even servers, whether they have ACPI/APM/nothing for power management.

    Once you've enabled it, you create a hibernation file on the C: drive. Hibernation should only take place when there is minimal disk activity (eg, don't hibernate while trying to save your Word document). The system saves the contents on RAM to the hard drive, and then shuts down. When the machine boots, a flag was set (I assume) indicating the system should resume from hibernation... so the hibernation file is read from disk and written to RAM and you're back up and running, in less time than it takes to boot. Plus it keeps your uptime from resetting back to zero.

    Some things to note:

    You will need WHQL certified drivers, or at least properly-written drivers. I have a SB Audigy and the first drivers I used (the ones on the included CD) caused a blue screen on resume from hibernation. When a updated driver was released, it fixed this issue.

    Applications need to be properly-written as well, as there is some sort of Win32 suspend signal that is sent to apps just before the system hibernates, so the app must support this and the resume command when the system is restored.

    Hibernation works great on my laptop and on my workstation, and I especially like the fact that I don't need to create a separate partition or install special drivers to make it work (you can even use it on an NTFS formatted drive).

    • Creative releasing drivers that cause a bluescreen?

      Who would of thought it was possible.

      Rule 1 with hibernation, no creative products.
    • This is not strictly speaking a W2K function. The real kicker here for Linux folks is that the easiest way to do hibernation in the modern world is to use ACPI, which Linux doesn't do very well. (See this week's LWN [lwn.net] for a timely discussion.

      APM BIOSes can also do this, but they aren't as standard: Often the implementation details are specific to the hardware. For instance, Phoenix BIOSes (at least as of two years ago, I haven't messed with this stuff much since then) tend to want to put the STD (suspend-to-disk) data in a special file in a Windows partition, while some others (Dell for sure, since I used to work this stuff for them) save this info in a special STD partition (type 84, IIRC) which is a more generic solution, but requires more knowledge when setting up the box. (When was the last time you thought you might need an STD partition when building your box? BTW, they should be at a minimum, PhysicalMemorySize + 1 MB for state info, video register settings, etc.)
      • This is not strictly speaking a W2K function.

        Agreed, and as you go on to explain, and I believe I alluded to in my post, there are many proprietary implementations via the BIOS or DOS drivers, etc.

        My point was that Windows 2000 separates the hibernation feature from the BIOS. As far as the BIOS can tell, the system is booting normally... but once the BIOS loads the NTLDR, Windows takes over of course and handles the hibernation. This is why it works so well and does not have all of the "stupid issues" such as custom drivers, partitions, or the like. The end result is not a MS-only function, but the implementation is, as far as I can tell.
    • Not according to Microsoft (on their knowledgebase). This article [microsoft.com] states that Win2k needs ACPI to support OS hibernation, and that the BIOS has to support it. Although Microsoft has been known to contradict itself.

      And simply having a WHQL-certified drivers doesn't necessarily mean it'll work. I had a Future Domain SCSI controller in my computer that loaded with the default Win2k WHQL driver, but I could never hibernate it. When I swapped it out with an Adaptec 2940UW, I was able to enable Hibernation in my Control Panel settings.

  • by Seth Finkelstein ( 90154 ) on Friday January 25, 2002 @02:19PM (#2902302) Homepage Journal
    The idea of saving the state of a process is very well-known. Take a look at anything from emacs dumping [berkeley.edu] to the gcore(1) [princeton.edu] program. It's been used in everything from saved games of Rogue to saved states of PERL.

    But isn't it overkill for a data-crunching operation? As many other people have noted, it would seem you're much better off checkpointing your data to disk, rather than relying on low-level OS process wizardry.

    Sig: What Happened To The Censorware Project (censorware.org) [sethf.com]

  • There is a kernel patch to do this. It's called Software Suspend [sch.bme.hu]. It is also part of the FOLK [sourceforge.net] project (Functionality Overloaded Linux Kernel, a project to merge the largest possible amount of patches into the kernel).
  • Surely if this process takes so long to execute the person who wrote it should have made it save its state every once in a while. Problems like these can have been avoided! Setiathome to name but one does exactly this.

    James
  • I think that this might also be a really good bug fix/hacking tool. I can also remember something like this for the Apple II in years gone by. You could press a button and take a snapshot of all memory in the system. Then you could write the executable part to disk and pick up where you left off. Good for freezing a copy of a game or whatever.

    This would also be good for tracking down bugs using the "before and after" technique.

    Such a program could be tied into the UPS monitor in such a way as to save everything that couldn't be stopped.
  • CDC Cyber 205 (Score:5, Interesting)

    by epepke ( 462220 ) on Friday January 25, 2002 @02:26PM (#2902356)

    As usual, this is ancient. Back at FSU, we had a CDC Cyber 205, a vector pipeline supercomputer, back in 1985. Any process could be crashed for a shutdown, and it produced a file that worked exactly like an executable and resumed computation from the time it was crashed.

  • by Nelson ( 1275 ) on Friday January 25, 2002 @02:28PM (#2902366)
    I've thought about this for booting issues. I have a server that's all journaled and everything and it's periodically get's bumped. Boot time is still on the order of 2 to 4 minutes for a full Linux server install. With my current stats that means I'm probably going to miss a hit or two on one of the web pages, all things being equal. A good portion of that is just icing though, things that are there "just in case" or get used infrequently. (Okay, I can screw with the init order and the problem essentially goes away or I can switch hardware but we're nerds and geeks so let's just explore this)


    I was thinking about this and here was my dirty hacky idea. You need kexec, lobos, or something similar (actually a fairly modified version of it) you'll need on the order of 8MB of disk space and some kernel mods, which might not be that extensive.


    I was thinking we develop some driver or process that consumes all of the memory and CPU in a system. It forces all of the processes to swap out, it would probably need to be a driver of sorts on current linux systems. Then it could dump the kcore out to a file somewhere, sync it, and hibernate. Then when the kernel boots up, if the right arg is passed in it could either load this image back in to ram in place of the kernel and then jump into it (easier said than done) early in the boot (page tables are made long before you have access to the drives and such so the logistics of this would need to be figured out) or it could boot up and use a different swapper partition and then have some kind of tool like kexec to load that image back in to ram and start it up. Or something, some how you should be able to recover the state of the system. File handles and everything would be there.


    The harder part would be hardware and network transparency. You'd need to modify all of your drivers to make sure that the hardware could be reset and they could deal with it. I think it's a little easier for the network side because it would be similar to simply unplugging the network cable, you have open sockets that are talking to nothing and some software can deal with that pretty well. There is also some kind of system integrity or robustness piece that is needed, if the system some how changes when you bring your old image back it could break things, munge files, etc..

  • by Pharmboy ( 216950 ) on Friday January 25, 2002 @02:28PM (#2902368) Journal
    seti@home kinda does it.

    the seti@home client uses its *.sah files to save the state of a calculation. of course, this is program dependent, not OS dependent. I guess if you have the source files for the program doing the counting.....

  • by Anonymous Coward on Friday January 25, 2002 @02:29PM (#2902372)
    STANDALONE CONDOR CHECKPOINTING:

    Using the Condor checkpoint library without the remote system call functionality and outside of the Condor system is known as
    "standalone" mode checkpointing.

    To link in standalone mode, follow the instructions for linking Condor executables, but replace condor_syscall_lib.a with libckpt.a. If you
    have installed Condor version 5.62 or above, you can easily link your program for standalone checkpointing using the condor_compile
    utility with the little-known "-condor_standalone" option. For example:

    condor_compile -condor_standalone [options/files....]

    where is any of cc, f77, gcc, g++, ld, etc. Just enter "condor_compile" by itself to see a usage summary, and/or refer to
    the condor_compile man page for additional information.

    Once your program is relinked with the Condor standalone-checkpointing library (libckpt.a), your program will sport two new command
    line arguments: "_condor_ckpt " and "_condor_restart ".

    If the command line looks like:

    exec_name -_condor_ckpt ..

    then we set up to checkpoint to the given file name.

    If the command line looks like:

    exec_name -_condor_restart ...

    then we effect a restart from the given file name.

    Any Condor command line options are removed from the head of the command line before main() is called. If we aren't given
    instructions on the command line, by default we assume we are an original invocation, and that we should write any checkpoints to the
    name by which we were invoked with a "ckpt" extension.

    To cause a program to checkpoint and exit, send it a SIGTSTP signal. For example, in C you would add the following line to your code:

    kill( getpid(), SIGTSTP );

    Note that most Unix shells are configured to send a TSTP signal to the foreground process when the user enters a Ctrl-Z. To cause a
    program to write a periodic checkpoint (i.e., checkpoint and continue running), sent it a SIGUSR2:

    kill( getpid(), SIGUSR2 );

    In addition to the command-line parameters interface described above, a C interface is also provided for restarting a program from a
    checkpoint file. The prototypes are:

    void init_image_with_file_name( char *ckpt_name );

    void init_image_with_file_descriptor( int fd );

    void restart( );

    The init_image_with_file_name() and init_image_with_file_descriptor() functions are used to specify the location of the checkpoint file.
    Only one of the two must be used. The restart() function causes the process image from the specified file to be read and restored.
  • by Alan ( 347 ) <arcterexNO@SPAMufies.org> on Friday January 25, 2002 @02:31PM (#2902374) Homepage
    I think it was somewhere in the list of patches from the -mjc tree (see here [slashdot.org]) that there was a patch for the entire kernel for linux. Basically it let the system save it's state, and then restore it if it detects that it was shut down at that point. I'm not sure if this is what you want (and I couldn't get it working), but it's certainly a step in the right direction to what you're looking for.

    Just found it here [kernel.org], it's the 'swsusp' patch.
  • If you utilize the java.io.serialization stuff right, you can create a lightweight persistence and should be able to freeze and resume processes on the same application if you handle threading right with it.
  • by Anonymous Coward
    The answer would have been 42 once the processing was complete. So who cares? Get a bigger UPS :-)
  • I think this problem is more easily solved in hardware than in software. With recent advances in solid-state memory, hopefully a standard can be worked out so that solid-state memory can replace or complement volatile memory (i.e., RAM as we know it.) Solid-state memory could would survive a power outage, and you could pick up where you left off.

    The disadvantages are speed (solid-state memory is getting faster all the time, but it is still slower than volatile RAM), cost, and lack of current standardized implementations (I'm not even sure there are any working implementations.)

    For some background research in solid-state memory, check out this site [nta.org] (it's a bit old, but still interesting.
  • by Mysticalfruit ( 533341 ) on Friday January 25, 2002 @02:45PM (#2902469) Homepage Journal
    What if the process has forked off a bunch of children? Are you going to archive all the children at the same time? What if the process has a whole bunch of files in /tmp, are you going to roll them up into the freeze state as well? What if your using pthreads? Are you going to keep the state for each thread? How about file pointers?

    I think the better solution is to write a new signal called "SIGFREEZE" and have programs just write code that could handle such an event. Let the program figure out how to save their own stuff.

    A good example would be a program that was calculating pi. The programmer would have to implient a signal handler that would when it recieved a SIGFREEZE would stop its computating and write what its currently working on out to file. The other thing the programmer should be doing is periodically writing their data out to a file anyway. Then the programmer should have implement a command line option that would facilitate reloading from a saved state.

    Thats my take on it...

    If you see any problems with it... bring it on.
  • If memory serves me (hey, it is Friday after all and both brain cells are pretty tired) we looked into something like what the poster was asking about years ago. In those days, we were running some simulations on a PDP-11/70 that took 7-10 days to complete. In the event of a general power failure we wouldn't have been able to run on backup power for very long. DEC's RSX had a feature whereby a task could be checkpointed to disk. Then, presumably, it could be reloaded and resumed at the same state it was in at the time of the checkpoint. We never did implement it since it would have introduced too much delay into the project schedule (adding it to the simulation, testing, etc.) but it sounds like the sort of thing that could be useful in current day OSs. Anyone know of any general purpose operating systems today that have this feature? I haven't heard of any and wonder (not too seriously, mind you) if anyone sells core memory for a PC architecture computer. Of course, it wouldn't be very fast but you'd worry a lot less about power failures that are longer than the UPS's ability to provide power.

  • by Anonymous Coward on Friday January 25, 2002 @03:57PM (#2903114)
    Sun already implements a system suspend/unsuspend in Solaris that works on all boxes but the Blade 100's.

    10 years ago I worked on a Unisys Unix box that did it automatically, meaning you could pull the power out of the wall without any warning and then plug it back in later. When the system rebooted, it would say "there's been a power failure, recovering" and then put all the processes back to the way their before. Even with an open vi session where I was actively typing, I wouldn't lose more than a character or two.

    I found out the machine had it quite by accident because my loser boss turned the box off one evening without doing a proper shutdown... Once I saw what it did, this required further testing :-)

    Still, what would be even better is if it could be done on a per process basis. I can think of many reason why you might want to suspend a process for a few days and bring it back later (say something you only wanted to run outside of work hours), but had no intention of shutting the whole box down. And this should be implemented in the kernel, not hacking each program to provide this functionality.
  • A case for Python (Score:3, Informative)

    by defile ( 1059 ) on Friday January 25, 2002 @07:42PM (#2904405) Homepage Journal

    Python [python.org] supports a concept that it calls 'pickling' (which is also known as Object Serialization).

    It's extremely easy to save the state of any object along with the objects it references to disk with literally a couple of lines of code (like, 3). You cannot pickle whole processes, but it's effortless to write some skeleton code to resume the process from its last pickle. You can also define specific methods in each object that are called on pickle/unpickle for special cases (restoring network connections, for example).

    The fact that it's an interpreted language shouldn't deter you. Python integrates easily with modules compiled from C, allowing you to accelerate time critical aspects of your code while rapidly developing the not so critical aspects.** Python was designed to solve the problems you're working on.

    Oh, and if you're short on time, don't worry; Python is extremely easy to learn.

    ** As most programmers have found, about 90% of their program's execution is spent in 5% of their code.

FORTRAN is not a flower but a weed -- it is hardy, occasionally blooms, and grows in every computer. -- A.J. Perlis

Working...