Forgot your password?
typodupeerror
Data Storage Operating Systems Linux

Linux Needs Resource Management For Complex Workloads 161

Posted by Soulskill
from the dirty-job-but-somebody's-gotta-do-it dept.
storagedude writes: Resource management and allocation for complex workloads has been a need for some time in open systems, but no one has ever followed through on making open systems look and behave like an IBM mainframe, writes Henry Newman at Enterprise Storage Forum. Throwing more hardware at the problem is a costly solution that won't work forever, he notes.

Newman writes: "With next-generation technology like non-volatile memories and PCIe SSDs, there are going to be more resources in addition to the CPU that need to be scheduled to make sure everything fits in memory and does not overflow. I think the time has come for Linux – and likely other operating systems – to develop a more robust framework that can address the needs of future hardware and meet the requirements for scheduling resources. This framework is not going to be easy to develop, but it is needed by everything from databases and MapReduce to simple web queries."
This discussion has been archived. No new comments can be posted.

Linux Needs Resource Management For Complex Workloads

Comments Filter:
  • by dbIII (701233) on Sunday July 20, 2014 @01:40AM (#47492651)

    "next-generation technology like non-volatile memories and PCIe SSDs"

    That generation has been going on for a while storagedude. People have been scaling according to load to deal with it.

    • by Anonymous Coward

      "next-generation technology like non-volatile memories and PCIe SSDs"

      That generation has been going on for a while storagedude. People have been scaling according to load to deal with it.

      He just woke up from a coma you insensitive clod.

    • Uh, no. PCIe SSDs are just coming into regular use in many places, and I haven't even heard of non-volatile memories being on the market (GB-sized, mind you - not tiny FRAMs for embedded applications).
      • Fusion-io's ioDrive has been around since 2007. It's been in regular use for those who need it - like 4k video editing.
        The original 7 year old drive is still faster than any SATA SSD you can find today.

        • That's the former, not the latter, but OK. (I also said "in many places", one would have thought it obvious that these things sort of trickle down from the top over time, especially given the initial limitations on the technology.)
        • by swb (14022)

          Yeah, but how many people were editing 4k video in 2007? I'm sure the 3 people at the time weren't worrying about scheduling their Fusion ioDrives across workloads, either, just pounding them into submission. Wider adoption usually means mixed workloads where scheduling scarce resources matters more and is more complicated.

          FWIW I don't know if I agree with the article premise -- it seems like most of these resource scheduling decisions/monitoring/adjustments are being made in hypervisors now (think VMware

      • by dbIII (701233)

        Uh, no. PCIe SSDs are just coming into regular use in many places

        OCZ seem to have been selling them via retail outlets for three years or more - let alone high end use.
        There were various PCI things before the PCIe interface came into use.

      • by EETech1 (1179269)

        IBM has DIMMs with flash memory already.

        www-03.ibm.com/systems/x/options/storage/solidstate/exflashdimm/

        • That's all fine and dandy, but the technological limitations of Flash memories put this in the "not quite there yet" territory when it comes to non-volatile RAMs. You wouldn't want to put your plain old in-memory data structures into that thing, so we're not quite there yet when it comes to unified memory architectures.
  • by Animats (122034) on Sunday July 20, 2014 @01:53AM (#47492699) Homepage

    That level of control probably belongs at the cluster management level. We need to do less in the OS, not more. For big data centers, images are loaded into virtual machines, network switches are configured to create a software defined network, connections are made between storage servers and compute nodes, and then the job runs. None of this is managed at the single-machine OS level.

    With some VM system like Xen managing the hardware on each machine, the client OS can be minimal. It doesn't need drivers, users, accounts, file systems, etc. If you're running in an Amazon AWS instance, at least 90% of Linux is just dead weight. Job management runs on some other machine that's managing the server farm.

    • Honestly, in MVS (z/OS), it probably makes perfect sense to have this in an OS, especially if you're paying through the nose for the hardware already. But solving it on the VM level surely makes it a huge win for everyone.
    • If you're running in an Amazon AWS instance, at least 90% of Linux is just dead weight

      Which 90% would that be, and in what way would it be dead weight? If you don't mind my asking.

    • by Lennie (16154) on Sunday July 20, 2014 @06:58AM (#47493377) Homepage

      Yes and no.

      No, large (Linux using) companies like Google, Facebook, Twitter have always used some kind of Linux container solution, not virtualization.

      Yes, policy is controlled by the cluster manager.

      But for example Google uses nested CGroups for implemeting those policies for controlling resources/priorities on their hosts.

      Virtualization is very ineffcient and Docker/Linux containers are a perfect example of how peole are starting to see that again:
      https://www.youtube.com/watch?... [youtube.com] / https://www.youtube.com/watch?... [youtube.com]

      Suppposedly, CPU utilization on AWS is very low, maybe even only 7%:
      http://huanliu.wordpress.com/2... [wordpress.com]

      The reason for that is, is that VMs get allocated resources they never end up using. Because the host kernel/hypervisor doesn't know what the VM (kernel) is going to do/need.

      For their own services Google doesn't use VMs, but Google does offer VMs to customers and to control the resources used by VM they run the VM inside a container.

      Here are some talks Google did at DockerCon that mentions some of the details of how they work:
      https://www.youtube.com/watch?... [youtube.com]
      https://www.youtube.com/watch?... [youtube.com]

    • If I understand the situation correctly - and it may be that I don't - this is what projects like Docker and chroot jails (?) were created to handle. You get most of the benefits of virtualization without most of the overhead. In a lot of cases you don't need the features that full virtualization provides over them.
      • Or more established/full featured, openvz, xen pv, lxc, cgroups/namespaces, and friends. I think linux (the kernel) already has the tools necessary to do task prioritization like the article requests.
        • I am familiar with cgroups, not the others. Thanks for letting me know where to continue my research.
  • Linux Cgroups (Score:4, Informative)

    by corychristison (951993) on Sunday July 20, 2014 @02:00AM (#47492711)

    Is this not what Linux Cgroups is for?

    From wikipedia (http://en.m.wikipedia.org/wiki/Cgroups):
    cgroups (abbreviated from control groups) is a Linux kernel feature to limit, account, and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups.

    From what I understand, LXC is built on top of Cgroups.

    I understand the article is talking about "mainframe" or "cloud" like build-outs but for the most part, what he is talking about is already coming together with Cgroups.

    • Re: (Score:2, Informative)

      by Anonymous Coward

      the article is not about "mainframe" or "cloud"... it is "advertising" for IBM... a company in the middle of multi-billion dollar deals with apple, all the while fighting to remain even slightly relevant.

      IBM has the magic solution to finally allow the world to run simple web queries.

      FUCK OFF

    • by davecb (6526) <davec-b@rogers.com> on Sunday July 20, 2014 @12:51PM (#47495245) Homepage Journal
      The only thing mainframes have that Unix/Linux Resource Managers lack is "goal mode". I can't set a TPS target and have resources automatically allocated to stay at or above the target. I *can* create minimum guarantees for CPU, memory and I/O bandwidth on Linux, BSD and the Unixes. I just have to manage the performance myself, by changing the minimums.
  • KVM, Xen and other hypervisors make Linux systems look like IBM mainframes. The whole "Virtual Machine" hype where we have guest operating systems running on hypervisors is just like IBMs Z series.
    • by Anonymous Coward

      KVM, Xen and other hypervisors make Linux systems look like IBM mainframes. The whole "Virtual Machine" hype where we have guest operating systems running on hypervisors is just like IBMs Z series.

      IBM had the System Resource Manager back in the 1980's when the "zOS" was still OS/MVS.

      More recently, Solaris had resource tuning features, although in my experience, people were preferring throwing cheap hardware at resource consumption over having tuning specialists or runing-aware system operations.

      The recent addition of cgroups to Linux means that it also has the potential to become tunable in terms of business goals, but again the question is, are people going to pay for the required expertise or are t

  • This feature was introduced in Windows Vista, and as we all know, this is the best OS ever because of that. Cant wait until Linux will becomes more like Vista.
  • by m00sh (2538182) on Sunday July 20, 2014 @02:40AM (#47492781)

    I read the article and I can't tell if this is a real problem that is really affecting thousands of users and companies, or a fantasy that the author wrote up in 30 minutes after having a discussion with an old IBM engineer.

    Sure, IBM has all these resource prioritization in mainframes because mainframes cost a lot of money. Nowadays, hardware is so cheap you don't have to do all that stuff.

    If some young programmer undertook the challenge and created the framework, would anyone use it and test it? Will there be an actual need for something like this?

    My point is that an insider information to what is really going on in the cutting edge usage of linux or just some smoke being blown around to an obligated write up.

    • by Kjella (173770)

      These resources are all being managed today, there already are priorities for CPU, QoS for network bandwidth, ionice and quotas for storage and so on with a lot of specialization in each. He wants to build some kind of comprehensive resource management framework where everything from CPU time, memory, storage, network bandwidth etc. is being prioritized. It sounds extremely academic to me, particularly when I read the line:

      I will make the assumption that everything at every level is monitored and tracked (...)

      Besides, resource management isn't something that happens only on this level, for exa

      • by Anonymous Coward

        Ha, your SQL server scenario is similar to one I've heard from IBM engineers (and IBM fellows) but with a priority inversion twist that requires SLAs and monitoring. That periodic consolidated report can become a nightmare when it finally grows to take longer than one period to complete! Enterprises come crashing down when these overlooked/implied invariants get violated. Eventually, increasing the job priority won't even work because it will squeeze out all the line of business workload, and what you re

    • by Anonymous Coward

      Nowadays, hardware is so cheap you don't have to do all that stuff.

      Instead of spending a bit of those resources to allocate the rest with good efficiency, the standing assumption is that resources are effectively free anyway and so wasting them with gay abandon is worth it. This is the assumption, but it's not really true.

      At sufficient scale even the smallest cost becomes non-negligible. This isn't just for the few of us who write "truly web-scale" or whatever the term is today. Even in something as simple as an end-user application like, oh, a video player, "saving" progr

  • by Anonymous Coward

    I thought the title wanted to talk about something revolutionary, so I read through the details.

    What I discovered was that the title was bullshit, so were the concerns surrounding Linux's capabilities. Some of them make sense for general all-purpose computation, some of them don't. I don't see why anybody should take these proposals too seriously for kernel inclusions.

    The portion on primary memory management is perfect. Hadoop does suffer from lack of cache aware code; So far, only modified kernels have bee

  • by lkcl (517947) <lkcl@lkcl.net> on Sunday July 20, 2014 @03:40AM (#47492919) Homepage

    i am running into exactly this problem on my current contract. here is the scenario:

    * UDP traffic (an external requirement that cannot be influenced) comes in
    * the UDP traffic contains multiple data packets (call them "jobs") each of which requires minimal decoding and processing
    * each "job" must be farmed out to *multiple* scripts (for example, 15 is not unreasonable)
    * the responses from each job running on each script must be collated then post-processed.

    so there is a huge fan-out where jobs (approximately 60 bytes) are coming in at a rate of 1,000 to 2,000 per second; those are being multiplied up by a factor of 15 (to 15,000 to 30,000 per second, each taking very little time in and of themselves), and the responses - all 15 to 30 thousand - must be in-order before being post-processed.

    so, the first implementation is in a single process, and we just about achieve the target of 1,000 jobs but only about 10 scripts per job.

    anything _above_ that rate and the UDP buffers overflow and there is no way to know if the data has been dropped. the data is *not* repeated, and there is no back-communication channel.

    the second implementation uses a parallel dispatcher. i went through half a dozen different implementations.

    the first ones used threads, semaphores through python's multiprocessing.Pipe implementation. the performance was beyond dreadful, it was deeply alarming. after a few seconds performance would drop to zero. strace investigations showed that at heavy load the OS call futex was maxed out near 100%.

    next came replacement of multiprocessing.Pipe with unix socket pairs and threads with processes, so as to regain proper control over signals, sending of data and so on. early variants of that would run absolutely fine up to some arbitrarry limit then performance would plummet to around 1% or less, sometimes remaining there and sometimes recovering.

    next came replacement of select with epoll, and the addition of edge-triggered events. after considerable bug-fixing a reliable implementation was created. testing began, and the CPU load slowly cranked up towards the maximum possible across all 4 cores.

    the performance metrics came out *WORSE* than the single-process variant. investigations began and showed a number of things:

    1) even though it is 60 bytes per job the pre-processing required to make the decision about which process to send the job were so great that the dispatcher process was becoming severely overloaded

    2) each process was spending approximately 5 to 10% of its time doing actual work and NINETY PERCENT of its time waiting in epoll for incoming work.

    this is unlike any other "normal" client-server architecture i've ever seen before. it is much more like the mainframe "job processing" that the article describes, and the linux OS simply cannot cope.

    i would have used POSIX shared memory Queues but the implementation sucks: it is not possible to identify the shared memory blocks after they have been created so that they may be deleted. i checked the linux kernel source: there is no "directory listing" function supplied and i have no idea how you would even mount the IPC subsystem in order to list what's been created, anyway.

    i gave serious consideration to using the python LMDB bindings because they provide an easy API on top of memory-mapped shared memory with copy-on-write semantics. early attempts at that gave dreadful performance: i have not investigated fully why that is: it _should_ work extremely well because of the copy-on-write semantics.

    we also gave serious consideration to just taking a file, memory-mapping it and then appending job data to it, then using the mmap'd file for spin-locking to indicate when the job is being processed.

    all of these crazy implementations i basically have absolutely no confidence in the linux kernel nor the GNU/Linux POSIX-compliant implementation of the OS on top - i have no confidence that it can handle the load.

    so i would be very interested to hear from anyone who has had to design similar architectures, and how they dealt with it.

    • Try putting a load balancer (Cisco ACE, Citrix NetScaler) on a virtual IP and load balancing the UDP packets across several nodes behind the balancer.

      • by Bengie (1121981)
        He said the CPU is mostly idle. He's trying to set up his system to handle lots of tiny tasks and Linux isn't playing well with the regular tools.
    • by Mr Thinly Sliced (73041) on Sunday July 20, 2014 @05:44AM (#47493195) Homepage Journal

      > the first ones used threads, semaphores through python's multiprocessing.Pipe implementation.

      I stopped reading when I came across this.

      Honestly - why are people trying to do things that need guarantees with python?

      The fact you have strict timing guarantees means you should be using a realtime kernel and realtime threads with a dedicated network card and dedicated processes on IRQs for that card.

      Take the incoming messages from UDP and post them on a message bus should be step one so that you don't lose them.

      • by lkcl (517947) <lkcl@lkcl.net> on Sunday July 20, 2014 @06:51AM (#47493359) Homepage

        > the first ones used threads, semaphores through python's multiprocessing.Pipe implementation.

        I stopped reading when I came across this.

        Honestly - why are people trying to do things that need guarantees with python?

        because we have an extremely limited amount of time as an additional requirement, and we can always rewrite critical portions or later the entire application in c once we have delivered a working system that means that the client can get some money in and can therefore stay in business.

        also i worked with david and we benchmarked python-lmdb after adding in support for looped sequential "append" mode and got a staggering performance metric of 900,000 100-byte key/value pairs, and a sequential read performance of 2.5 MILLION records. the equivalent c benchmark is only around double those numbers. we don't *need* the dramatic performance increase that c would bring if right now, at this exact phase of the project, we are targetting something that is 1/10th to 1/5th the performance of c.

        so if we want to provide the client with a product *at all*, we go with python.

        but one thing that i haven't pointed out is that i am an experienced linux python and c programmer, having been the lead developer of samba tng back from 1997 to 2000. i simpy transferred all of the tricks that i know involving while-loops around non-blocking sockets and so on over to python. ... and none of them helped. if you get 0.5% of the required performance in python, it's so far off the mark that you know something is drastically wrong. converting the exact same program to c is not going to help.

        The fact you have strict timing guarantees means you should be using a realtime kernel and realtime threads with a dedicated network card and dedicated processes on IRQs for that card.

        we don't have anything like that [strict timing guarantees] - not for the data itself. the data comes in on a 15 second delay (from the external source that we do not have control over) so a few extra seconds delay is not going to hurt.

        so although we need the real-time response to handle the incoming data, we _don't_ need the real-time capability beyond that point.

        Take the incoming messages from UDP and post them on a message bus should be step one so that you don't lose them.

        .... you know, i think this is extremely sensible advice (which i have heard from other sources) so it is good to have that confirmed... my concerns are as follows:

        questions:

        * how do you then ensure that the process receiving the incoming UDP messages is high enough priority to make sure that the packets are definitely, definitely received?

        * what support from the linux kernel is there to ensure that this happens?

        * is there a system call which makes sure that data received on a UDP socket *guarantees* that the process receiving it is woken up as an absolute priority over and above all else?

        * the message queue destination has to have locking otherwise it will be corrupted. what happens if the message queue that you wish to send the UDP packet to is locked by a *lower* priority process?

        * what support in the linux kernel is there to get the lower priority process to have its priority temporarily increased until it lets go of the message queue on which the higher-priority task is critically dependent?

        this is exactly the kind of thing that is entirely missing from the linux kernel. temporary automatic re-prioritisation was something that was added to solaris by sun microsystems quite some time ago.

        to the best of my knowledge the linux kernel has absolutely no support for these kinds of very important re-prioritisation requirements.

        • by Mr Thinly Sliced (73041) on Sunday July 20, 2014 @07:25AM (#47493435) Homepage Journal

          First - the problem with python is that because it's a VM you've got a whole lot of baggage in that process out of your control (mutexes, mallocs, stalls for housekeeping).

          Basically you've got a strict timing guarantee dictated by the fact that you have incoming UDP packets you can't afford to drop.

          As such, you need a process sat on that incoming socket that doesn't block and can't be interrupted.

          The way you do that is to use a realtime kernel and dedicate a CPU using process affinity to a realtime receiver thread. Make sure that the only IRQ interrupt mapped to that CPU is the dedicated network card. (Note: I say realtime receiver thread, but in fact it's just a high priority callback down stack from the IRQ interrupt).

          This realtime receiver thread should be a "complete" realtime thread - no malloc, no mutexes. Passing messages out of these realtime threads should be done via non-blocking ring buffers to high (regular) priority threads who are in charge of posting to something like zeromq.

          Depending on your deadlines, you can make it fully non-blocking but you'll need to dedicate a CPU to spin lock checking that ring buffer for new messages. Second option is that you calculate your upper bound on ring buffer fill and poll it every now and then. You can use semaphores to signal between the threads but you'll need to make that other thread realtime too to avoid a possible priority inversion situation.

          > how do you then ensure that the process receiving the incoming UDP messages is high enough priority to make sure that the packets are definitely, definitely received

          As mentioned, dedicate a CPU mask everything else off from it and make the IRQ point to it.

          > what support from the linux kernel is there to ensure that this happens

          With a realtime thread the only other thing that could interrupt it would be another realtime priority thread - but you should make sure that situation doesn't occur.

          > is there a system call which makes sure that data received on a UDP socket *guarantees* that the process receiving it is woken up as an absolute priority over and above all else

          Yes, IRQ mapping to the dedicated CPU with a realtime receiver thread.

          > the message queue destination has to have locking otherwise it will be corrupted. what happens if the message queue that you wish to send the UDP packet to is locked by a *lower* priority process

          You might get away with having the realtime receiver thread do the zeromq message push (for example) but the "real" way to do this would be lock-free ring buffers and another thread being the consumer of that.

          > what support in the linux kernel is there to get the lower priority process to have its priority temporarily increased until it lets go of the message queue on which the higher-priority task is critically dependent

          You want to avoid this. Use lockfree structures for correctness - or you may discover that having the realtime receiver thread do the post is "good enough" for your message volumes.

          > to the best of my knowledge the linux kernel has absolutely no support for these kinds of very important re-prioritisation requirements

          No offense, but Linux has support for this kind of scenario, you're just a little confused about how you go about it. Priority inversion means you don't want to do it this way on _any_ operating system, not just Linux.

          • by lkcl (517947)

            hi mr thinly-sliced, thank you this is awesome advice, really really appreciated.

            • You're welcome - I hope you get it sorted out.

              The only other thing I'd mention - you perhaps noticed I kept saying "threads like.." and "with regular threads" because it's basically introduced a number of single points of failure. Due to the lack of back channel or retransmission, things can go silently wrong without notice (network cable failure etc). In an ideal world you'd double up on some of that infrastructure and networking.

              I know you need to get something up and running, but it's perhaps something t

        • You should look up mutex attributes, in particular priority inheritance. Also, I think you are experiencing the "thundering herd" effect. Maybe the leader/follower pattern could be effective here.
        • Given this problem, there are several options for fanout... Im assuming that hardware can be added, so adding a load balancer and then three or four machines to cope with the load behind the load balancer might be the quickest (least code change) way to address the issue. Especially if there is no global state needed, this is likely the most expedient.

          An option that might be a bit more flexible on a single box, while still scalable, would be to have a task that parses each incoming job and posts it to a

        • by sjames (1099)

          You'll need a bit of C, but consider using sched_setscheduler on the receiver process to make sure you get the packets before the buffer fills. That process can have a big buffer and keep a queue stuffed for the actual handling. Probably one thread to receive and one to stuff the queue will work.

          The worker processes can remain as python processes at that point. As long as your queue is lossless and the workers are on average fast enough AND their jitter is smaller than your buffer in the high performance C

      • by Alef (605149)

        Honestly - why are people trying to do things that need guarantees with python?

        Oh, you got that far at least? What I wonder is, why are people trying to do things that need guarantees using UDP with no back-communication, no redundancy built in to the protocol, and not even detection of lost packets? External requirement my ass, why do you accept a contract under those conditions? The correct thing to say is "this is broken, and it's not going to work". If they still want the turd polished, it should be und

        • FWIW I agree vis-a-vis using UDP for a business critical thing. I'd want exemption from responsiblity for any missed packets purely due to the infrastructure in between.

        • by hyc (241590)

          Totally agreed. The lack of guarantees re: UDP is built into the UDP spec, it's not a failing of the Linux kernel (nor any other OS) that it won't tell you about dropped packets. Luke, you should know better than this.

          • by Bengie (1121981)
            But if done correctly, you can do line rate UDP with 0% loss. Routers can do line rate without loss all the time. He's talking about thousands of packets per second, not the millions to tens of millions a modern NIC can handle.
            • by Alef (605149)

              Of course it's technically possible to transmit packets with essentially 0% loss, and I'm sure there are set-ups that would work under the right circumstances. That's not the point. The point is that each and every component involved, from hardware through firmware to software, is designed under the premiss that it is okay to drop a packet at any time for any reason, or to duplicate or reorder packets. Even if you get it to work, the replacement of any single component, or the triggering of some corner case

              • by Bengie (1121981)

                The point is that each and every component involved, from hardware through firmware to software, is designed under the premiss that it is okay to drop a packet at any time for any reason, or to duplicate or reorder packets.

                That entire sentence is damn near a lie. Those issue can happen, but they shouldn't happen. You almost have to go out of your way to make those situations happen. Dropping a packet should NEVER happen except when going past line rate. Packets should NEVER be duplicated or reordered except in the case of a misconfiguration of a network. Networks are FIFO and they don't just duplicate packets for the fun of it.

                As for error rates, many high end network devices can upwards of an error rate of 10E-18, which pu

        • by awol (98751)

          Absolutely. Soooo doomed. You cannot guarantee that the UDP packets even get across the wire to your NIC what difference does it matter whether you software gets them all out of the NIC

          • by Bengie (1121981)
            What kind of crappy network equipment does your job use that has packet loss at anything less than line rate? He's talking about near 1mbit/sec of UDP. I can get 0% packet-loss around the world for only 1mb/s
      • by BitZtream (692029)

        Honestly - why are people trying to do things that need guarantees with python?

        Because they don't actually know how to do what they are claiming the requirements are and they refuse to turn it over to someone who does.

        I'd have thought that was pretty clear. Trying to do real time work in python made it clear to me.

    • by Gothmolly (148874)

      a) Your UDP buffers probably suck. OOB RedHat gives you 128K, and each packet takes up 2304 bytes of buffer space. Try 100MB, or whatever YOUR_RATE/2304 works out to.
      b) Pull off the queue and buffer in RAM as fast as you can
      c) Have a second thread read from RAM
      d) Don't invoke scripts to process each packet, you're spinning all your time in process creation. In fact, don't use interpreted scripts at all.

    • by raxx7 (205260)

      Interesting. I sounds a bit like an application I have.
      Like yours, it involves UDP and Python.
      I have 150.000 "jobs" per second arriving in UDP packets. "Job" data can be between 10 and 1400 bytes and as many "jobs" are packed into each UDP packet as possible.

      I use Python because, intermixed with the high performance job processing, I also mix slow but complex control sequences (and I'd rather cut my wrists than move all that to C/C++).
      But to achieve good performance, I had to reduce Python's contribution to

    • by Greyfox (87712)
      Could you put multiple network cards on your scheduler machine, put the workers on different subnets and randomly dole out the jobs between those subnets? Seems like you'd be less likely to drop UDP packets that way, I'm pretty sure I ran across a utility (lsipc or something) that would list IPC resources, including shared memory. I seem to recall that the segments also show up in /proc somewhere. It's been a while since I've looked at it.

      Not being able to ack important message packets seems like a design

    • Consider trying QNX, the message-passing real time OS, for this. This is a message passing problem, and Linux doesn't do message passing well. QNX has a scheduler optimized for message passing. You should be able to handle the UDP front end and fan-out without any problems. You can give the front-end process a higher priority than the other processes, which should let you get all the UDP packets into the fan-out program without losing any. That's what real-time OSs are for.

      Trying to do anything high-per

    • by RelliK (4466)

      > the first ones used threads, semaphores through python's multiprocessing.Pipe implementation. the performance was beyond dreadful, it was deeply alarming. after a few seconds performance would drop to zero. strace investigations showed that at heavy load the OS call futex was maxed out near 100%.

      uhhm... wait what?

      You are aware that python has global interpreter lock [python.org], right? And because of that multi-threaded performance in python is actually *worse* [dabeaz.com] than single-threaded? But this is an inherent flaw in

    • by Bengie (1121981)
      When you handling lots of little messages/jobs/tasks that are coming in quickly, passing data between processes is a horrible idea. Between context switching and system calls, you're destroying your performance.

      You need to make larger batches.

      1) UDP/Job comes in, write to single-writer many reader queue(large circular queues can be good for this) and the order number, maybe a 64bit incrementing integer. If the run time per job is quite constant, then you could use several single reader/writer queues a
  • Weren't they added in Linux 0.01 around 1991?

  • There is a solution that does this, it called a mainframe, they're hideously expensive, cooked a motherboard recently 1.2 million, want a 10G network card $20000. Now you can buy an awful lot of commodity hardware for much cheaper so that you have excess resources, need a dedicated system for a database buy one, run the other applications on a shared resource, you'll still end up with spare change if you dump a mainframe contract. You can replace a mainframe with commodity items you just need to plan for it

    • by Bengie (1121981)
      The whole point of QoS is to not have to add more hardware, but to make better use of your current hardware while not having large amounts of jitter. Mainframes don't need to worry about interactive processes, but many modern day work loads do. What they want is a good average throughput with a maximum latency.
  • by Cyberax (705495) on Sunday July 20, 2014 @06:20AM (#47493285)
    Really. Author is an idiot. He should actually read something that is not a documentation volume for his beloved IBM mainframe.

    Linux has cgroups support which allows to partition a machine into multiple hierarchic containers. Memory and CPU partitioning works well, so it's easy to give only a certain percentage of CPU, RAM and/or swap to a specific set of tasks. Direct disk IO is getting in shape.

    Lots of people are cgroups in production on very large scales. There are still some gaps and inconsistencies around the edges (for example, buffered IO bandwidth can't be metered) but kernel developers are working on fixing them.
  • Moore's Law speaks to computational horsepower per unit per cost. But even if the computational abilities do not continue to increase, the costs will keep coming down.

    Hardware is cheap. It's not an elegant solution, but it's cheap. And getting cheaper.

    Focus on the UX, because without that, who cares what your kernel can do? Machines are plenty powerful enough, what you want to do is get your OS in to the hands of the most users possible .... right?

    • by Jeremi (14640)

      Hardware is cheap. It's not an elegant solution, but it's cheap. And getting cheaper.

      Right, but if your company comes up with an elegant solution that gets 10x better performance out of a given piece of hardware, and your competitors cannot (or do not) do the same, then you've got a cost advantage over your competitors and can use that to get customers to choose to buy your product rather than theirs.

      That will always be true, no matter how fast and cheap the hardware gets. Either your customers will be able to do 10 times more work with your product, or (if there isn't 10 times more work t

  • I don't have hard data yet, but I'm finding that EL7 is much much faster than EL6 on the same hardware for the workloads I've tried so far.

    I don't know that tuned [fedorahosted.org] is most responsible, but I can see that it's running and that's what it's supposed to do.

    I realize that the kernel is better and perhaps XFS helps, but those alone seem insufficient to realize the difference.

    Anyway, it's somewhat along the direction people are talking about, even if only minimally.

  • ... but no one has ever followed through on making open systems look and behave like an IBM mainframe, ...

    But I'll need a punch-card station and reader, build out my server room with a glass service window, hire a disinterested, snarky guy to retrieve printouts ... Or have IBM mainframes changed since my college days back in the late '80s?

  • I think this person is still mad that linux doesn't feed out accurate memory usage ever since COW pages were introduced, let alone multiple efficiency steps since then.

    Not going to say that task management over a greater picture's a bad idea, but have to make it more coarse (per server, approximations) rather than fine if one is to still be able to effectively use many of Linux' performance improvements above IBM mainframe approaches. Mind, I've built a couple of systems like that for proprietary infrastr

Business is a good game -- lots of competition and minimum of rules. You keep score with money. -- Nolan Bushnell, founder of Atari

Working...