Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Data Storage Operating Systems Linux

Linux Needs Resource Management For Complex Workloads 161

storagedude writes: Resource management and allocation for complex workloads has been a need for some time in open systems, but no one has ever followed through on making open systems look and behave like an IBM mainframe, writes Henry Newman at Enterprise Storage Forum. Throwing more hardware at the problem is a costly solution that won't work forever, he notes.

Newman writes: "With next-generation technology like non-volatile memories and PCIe SSDs, there are going to be more resources in addition to the CPU that need to be scheduled to make sure everything fits in memory and does not overflow. I think the time has come for Linux – and likely other operating systems – to develop a more robust framework that can address the needs of future hardware and meet the requirements for scheduling resources. This framework is not going to be easy to develop, but it is needed by everything from databases and MapReduce to simple web queries."
This discussion has been archived. No new comments can be posted.

Linux Needs Resource Management For Complex Workloads

Comments Filter:
  • by Lisias ( 447563 ) on Sunday July 20, 2014 @02:40AM (#47492649) Homepage Journal

    I know you're afraid of the garbage collector, but it won't bite. I promise.

    Yes, it will. It's not common, but it happens - and when it happens, it's nasty. Pretty nasty.

    But not so nasty as micromanaging the memory by myself, so I keep licking my wounds and moving on with it.

    (but sometimes would be nice to have fine control on it)

  • by K. S. Kyosuke ( 729550 ) on Sunday July 20, 2014 @04:24AM (#47492889)

    Garbage collection necessarily wastes memory by factor of 1.5 to 2.

    And manual memory management on a similar scale wastes CPU time. And the techniques that alleviate one also tend to help the other, or not?

    Finally, the most important aspect for program performance is locality and memory layout, something you cannot even optimize for in a language where every object is a pointer to some memory on a garbage-collected heap.

    There's not a dichotomy here. Oberon and Go are garbage collected without everything being a heap pointer.

  • by lkcl ( 517947 ) <lkcl@lkcl.net> on Sunday July 20, 2014 @04:40AM (#47492919) Homepage

    i am running into exactly this problem on my current contract. here is the scenario:

    * UDP traffic (an external requirement that cannot be influenced) comes in
    * the UDP traffic contains multiple data packets (call them "jobs") each of which requires minimal decoding and processing
    * each "job" must be farmed out to *multiple* scripts (for example, 15 is not unreasonable)
    * the responses from each job running on each script must be collated then post-processed.

    so there is a huge fan-out where jobs (approximately 60 bytes) are coming in at a rate of 1,000 to 2,000 per second; those are being multiplied up by a factor of 15 (to 15,000 to 30,000 per second, each taking very little time in and of themselves), and the responses - all 15 to 30 thousand - must be in-order before being post-processed.

    so, the first implementation is in a single process, and we just about achieve the target of 1,000 jobs but only about 10 scripts per job.

    anything _above_ that rate and the UDP buffers overflow and there is no way to know if the data has been dropped. the data is *not* repeated, and there is no back-communication channel.

    the second implementation uses a parallel dispatcher. i went through half a dozen different implementations.

    the first ones used threads, semaphores through python's multiprocessing.Pipe implementation. the performance was beyond dreadful, it was deeply alarming. after a few seconds performance would drop to zero. strace investigations showed that at heavy load the OS call futex was maxed out near 100%.

    next came replacement of multiprocessing.Pipe with unix socket pairs and threads with processes, so as to regain proper control over signals, sending of data and so on. early variants of that would run absolutely fine up to some arbitrarry limit then performance would plummet to around 1% or less, sometimes remaining there and sometimes recovering.

    next came replacement of select with epoll, and the addition of edge-triggered events. after considerable bug-fixing a reliable implementation was created. testing began, and the CPU load slowly cranked up towards the maximum possible across all 4 cores.

    the performance metrics came out *WORSE* than the single-process variant. investigations began and showed a number of things:

    1) even though it is 60 bytes per job the pre-processing required to make the decision about which process to send the job were so great that the dispatcher process was becoming severely overloaded

    2) each process was spending approximately 5 to 10% of its time doing actual work and NINETY PERCENT of its time waiting in epoll for incoming work.

    this is unlike any other "normal" client-server architecture i've ever seen before. it is much more like the mainframe "job processing" that the article describes, and the linux OS simply cannot cope.

    i would have used POSIX shared memory Queues but the implementation sucks: it is not possible to identify the shared memory blocks after they have been created so that they may be deleted. i checked the linux kernel source: there is no "directory listing" function supplied and i have no idea how you would even mount the IPC subsystem in order to list what's been created, anyway.

    i gave serious consideration to using the python LMDB bindings because they provide an easy API on top of memory-mapped shared memory with copy-on-write semantics. early attempts at that gave dreadful performance: i have not investigated fully why that is: it _should_ work extremely well because of the copy-on-write semantics.

    we also gave serious consideration to just taking a file, memory-mapping it and then appending job data to it, then using the mmap'd file for spin-locking to indicate when the job is being processed.

    all of these crazy implementations i basically have absolutely no confidence in the linux kernel nor the GNU/Linux POSIX-compliant implementation of the OS on top - i have no confidence that it can handle the load.

    so i would be very interested to hear from anyone who has had to design similar architectures, and how they dealt with it.

  • by Tough Love ( 215404 ) on Sunday July 20, 2014 @05:12AM (#47492975)

    Garbage collector with no overhead, hmm? Easy peasy with no satanic complexity I suppose. And of course no obnoxious corner cases. Equivalently in engineering, when your bridge won't stay up you just add a sky hook. Easy.

  • by Mr Thinly Sliced ( 73041 ) on Sunday July 20, 2014 @06:44AM (#47493195) Journal

    > the first ones used threads, semaphores through python's multiprocessing.Pipe implementation.

    I stopped reading when I came across this.

    Honestly - why are people trying to do things that need guarantees with python?

    The fact you have strict timing guarantees means you should be using a realtime kernel and realtime threads with a dedicated network card and dedicated processes on IRQs for that card.

    Take the incoming messages from UDP and post them on a message bus should be step one so that you don't lose them.

Prediction is very difficult, especially of the future. - Niels Bohr

Working...