Linux Needs Resource Management For Complex Workloads

Linux Needs Resource Management For Complex Workloads 161

Posted by Soulskill on Sunday July 20, 2014 @02:23AM from the dirty-job-but-somebody's-gotta-do-it dept.

storagedude writes: Resource management and allocation for complex workloads has been a need for some time in open systems, but no one has ever followed through on making open systems look and behave like an IBM mainframe, writes Henry Newman at Enterprise Storage Forum. Throwing more hardware at the problem is a costly solution that won't work forever, he notes.

Newman writes: "With next-generation technology like non-volatile memories and PCIe SSDs, there are going to be more resources in addition to the CPU that need to be scheduled to make sure everything fits in memory and does not overflow. I think the time has come for Linux – and likely other operating systems – to develop a more robust framework that can address the needs of future hardware and meet the requirements for scheduling resources. This framework is not going to be easy to develop, but it is needed by everything from databases and MapReduce to simple web queries."

Linux Needs Resource Management For Complex Workloads

This discussion has been archived. No new comments can be posted.

Search 161 Comments Log In/Create an Account

Comments Filter:

This belongs in the cluster manager (Score:5, Informative)

by Animats ( 122034 ) writes: on Sunday July 20, 2014 @02:53AM (#47492699) Homepage

That level of control probably belongs at the cluster management level. We need to do less in the OS, not more. For big data centers, images are loaded into virtual machines, network switches are configured to create a software defined network, connections are made between storage servers and compute nodes, and then the job runs. None of this is managed at the single-machine OS level.
With some VM system like Xen managing the hardware on each machine, the client OS can be minimal. It doesn't need drivers, users, accounts, file systems, etc. If you're running in an Amazon AWS instance, at least 90% of Linux is just dead weight. Job management runs on some other machine that's managing the server farm.

Linux Cgroups (Score:4, Informative)

by corychristison ( 951993 ) writes: on Sunday July 20, 2014 @03:00AM (#47492711)

Is this not what Linux Cgroups is for?
From wikipedia (http://en.m.wikipedia.org/wiki/Cgroups):
cgroups (abbreviated from control groups) is a Linux kernel feature to limit, account, and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups.
From what I understand, LXC is built on top of Cgroups.
I understand the article is talking about "mainframe" or "cloud" like build-outs but for the most part, what he is talking about is already coming together with Cgroups.

Re:Linux Cgroups (Score:2, Informative)

by Anonymous Coward writes: on Sunday July 20, 2014 @03:46AM (#47492797)

the article is not about "mainframe" or "cloud"... it is "advertising" for IBM... a company in the middle of multi-billion dollar deals with apple, all the while fighting to remain even slightly relevant.
IBM has the magic solution to finally allow the world to run simple web queries.
FUCK OFF

Re:mainframe is old crap for geezers (Score:2, Informative)

by Anonymous Coward writes: on Sunday July 20, 2014 @05:35AM (#47493015)

Yeah - the sky is the limit!!!
Use your Microsoft cloud capabilities without hesitation....
This message was brought by you by your friendly NSA..

Re:mainframe is old crap for geezers (Score:4, Informative)

by viperidaenz ( 2515578 ) writes: on Sunday July 20, 2014 @06:02AM (#47493075)

On the contrary, if you can increase the performance of each node by 2x with 100,000 nodes, you've just saved 50,000 of them.
That's a pretty big cost saving.
The larger the installation, the more important resource management is. If you need to add more node, not only do you need to buy them, increase network capacity and power them, you also need to increase your cooling capacity, and floor space. Your failure rate goes up too. The higher the failure rate, the more staff you need to replace things.

Re:complex application example (Score:5, Informative)

by lkcl ( 517947 ) writes: <lkcl@lkcl.net> on Sunday July 20, 2014 @07:51AM (#47493359) Homepage

> the first ones used threads, semaphores through python's multiprocessing.Pipe implementation.
I stopped reading when I came across this.
Honestly - why are people trying to do things that need guarantees with python?
because we have an extremely limited amount of time as an additional requirement, and we can always rewrite critical portions or later the entire application in c once we have delivered a working system that means that the client can get some money in and can therefore stay in business.
also i worked with david and we benchmarked python-lmdb after adding in support for looped sequential "append" mode and got a staggering performance metric of 900,000 100-byte key/value pairs, and a sequential read performance of 2.5 MILLION records. the equivalent c benchmark is only around double those numbers. we don't *need* the dramatic performance increase that c would bring if right now, at this exact phase of the project, we are targetting something that is 1/10th to 1/5th the performance of c.
so if we want to provide the client with a product *at all*, we go with python.
but one thing that i haven't pointed out is that i am an experienced linux python and c programmer, having been the lead developer of samba tng back from 1997 to 2000. i simpy transferred all of the tricks that i know involving while-loops around non-blocking sockets and so on over to python. ... and none of them helped. if you get 0.5% of the required performance in python, it's so far off the mark that you know something is drastically wrong. converting the exact same program to c is not going to help.
The fact you have strict timing guarantees means you should be using a realtime kernel and realtime threads with a dedicated network card and dedicated processes on IRQs for that card.
we don't have anything like that [strict timing guarantees] - not for the data itself. the data comes in on a 15 second delay (from the external source that we do not have control over) so a few extra seconds delay is not going to hurt.
so although we need the real-time response to handle the incoming data, we _don't_ need the real-time capability beyond that point.
Take the incoming messages from UDP and post them on a message bus should be step one so that you don't lose them.
.... you know, i think this is extremely sensible advice (which i have heard from other sources) so it is good to have that confirmed... my concerns are as follows:
questions:
* how do you then ensure that the process receiving the incoming UDP messages is high enough priority to make sure that the packets are definitely, definitely received?
* what support from the linux kernel is there to ensure that this happens?
* is there a system call which makes sure that data received on a UDP socket *guarantees* that the process receiving it is woken up as an absolute priority over and above all else?
* the message queue destination has to have locking otherwise it will be corrupted. what happens if the message queue that you wish to send the UDP packet to is locked by a *lower* priority process?
* what support in the linux kernel is there to get the lower priority process to have its priority temporarily increased until it lets go of the message queue on which the higher-priority task is critically dependent?
this is exactly the kind of thing that is entirely missing from the linux kernel. temporary automatic re-prioritisation was something that was added to solaris by sun microsystems quite some time ago.
to the best of my knowledge the linux kernel has absolutely no support for these kinds of very important re-prioritisation requirements.

Re:complex application example (Score:5, Informative)

by Mr Thinly Sliced ( 73041 ) writes: on Sunday July 20, 2014 @08:25AM (#47493435) Journal

First - the problem with python is that because it's a VM you've got a whole lot of baggage in that process out of your control (mutexes, mallocs, stalls for housekeeping).
Basically you've got a strict timing guarantee dictated by the fact that you have incoming UDP packets you can't afford to drop.
As such, you need a process sat on that incoming socket that doesn't block and can't be interrupted.
The way you do that is to use a realtime kernel and dedicate a CPU using process affinity to a realtime receiver thread. Make sure that the only IRQ interrupt mapped to that CPU is the dedicated network card. (Note: I say realtime receiver thread, but in fact it's just a high priority callback down stack from the IRQ interrupt).
This realtime receiver thread should be a "complete" realtime thread - no malloc, no mutexes. Passing messages out of these realtime threads should be done via non-blocking ring buffers to high (regular) priority threads who are in charge of posting to something like zeromq.
Depending on your deadlines, you can make it fully non-blocking but you'll need to dedicate a CPU to spin lock checking that ring buffer for new messages. Second option is that you calculate your upper bound on ring buffer fill and poll it every now and then. You can use semaphores to signal between the threads but you'll need to make that other thread realtime too to avoid a possible priority inversion situation.
> how do you then ensure that the process receiving the incoming UDP messages is high enough priority to make sure that the packets are definitely, definitely received
As mentioned, dedicate a CPU mask everything else off from it and make the IRQ point to it.
> what support from the linux kernel is there to ensure that this happens
With a realtime thread the only other thing that could interrupt it would be another realtime priority thread - but you should make sure that situation doesn't occur.
> is there a system call which makes sure that data received on a UDP socket *guarantees* that the process receiving it is woken up as an absolute priority over and above all else
Yes, IRQ mapping to the dedicated CPU with a realtime receiver thread.
> the message queue destination has to have locking otherwise it will be corrupted. what happens if the message queue that you wish to send the UDP packet to is locked by a *lower* priority process
You might get away with having the realtime receiver thread do the zeromq message push (for example) but the "real" way to do this would be lock-free ring buffers and another thread being the consumer of that.
> what support in the linux kernel is there to get the lower priority process to have its priority temporarily increased until it lets go of the message queue on which the higher-priority task is critically dependent
You want to avoid this. Use lockfree structures for correctness - or you may discover that having the realtime receiver thread do the post is "good enough" for your message volumes.
> to the best of my knowledge the linux kernel has absolutely no support for these kinds of very important re-prioritisation requirements
No offense, but Linux has support for this kind of scenario, you're just a little confused about how you go about it. Priority inversion means you don't want to do it this way on _any_ operating system, not just Linux.

Re:Linux Cgroups are a good subset of this (Score:4, Informative)

by davecb ( 6526 ) writes: <davecb@spamcop.net> on Sunday July 20, 2014 @01:51PM (#47495245) Homepage Journal

The only thing mainframes have that Unix/Linux Resource Managers lack is "goal mode". I can't set a TPS target and have resources automatically allocated to stay at or above the target. I *can* create minimum guarantees for CPU, memory and I/O bandwidth on Linux, BSD and the Unixes. I just have to manage the performance myself, by changing the minimums.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Linux Needs Resource Management For Complex Workloads 161

Linux Needs Resource Management For Complex Workloads More Login

Linux Needs Resource Management For Complex Workloads

This belongs in the cluster manager (Score:5, Informative)

Linux Cgroups (Score:4, Informative)

Re:Linux Cgroups (Score:2, Informative)

Re:mainframe is old crap for geezers (Score:2, Informative)

Re:mainframe is old crap for geezers (Score:4, Informative)

Re:complex application example (Score:5, Informative)

Re:complex application example (Score:5, Informative)

Re:Linux Cgroups are a good subset of this (Score:4, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot