Linux 2.6 Multithreading Advances 200
chromatic writes "Jerry Cooperstein has just written an excellent article explaining the competing threading implementations in the upcoming 2.6 kernel for the O'Reilly Network."
A Fortran compiler is the hobgoblin of little minis.
Non-threaded programs (Score:2, Insightful)
The worst example of this was the Quake I source code, which was used for many games, including Half-Life. The code was not multi-threaded, and the network code sat idle while everything else drew -- adding about 20ms of lag, unless you cut the frame rate down to about 15 or so.
The problem wasn't fixed in Half-Life -- the most popular multiplayer game of all time -- until sometime in 2000. We can only imagine how many other programs are not taking full advantage of multithreading.
Actually, Quake II (Score:1)
Re:Actually, Quake II (Score:3, Informative)
Quake's original networking was meant for LANs only- the fact that it was even barely playable over the internet suprised the authors.
idsoftware soon released QuakeWorld free to Quake owners. It used the same interface and most of the graphics resources as Quake, so its arguably not a different program. But it came as a separate executable, with many Quake features removed (like monsters). And most importantly, the networking code was entirely re-written.
It is that code that QuakeII and successors derived from.
WRONG!!! Half-Life was based, mostly, off of Quake (Score:5, Interesting)
Half-Life was mostly based off Quake1. The network protocol and prediction code was taken from QuakeWorld. Some small Quake2 functionality was merged later on.
The initial release of Half-Life was approximatly 65% Quake1, 15% QuakeWorld, 5% Quake2 and 15% Original(Not including the bastardisation of the code into MFC/C++).
And yes, people from Valve have confirmed the base was Quake1, not (as some people continue to claim, and I really wish I knew where the rumor started) Quake2.
Also, the percentages are based off some reverse engineering work I done a while ago when I was playing with making a compatible Linux clone of Half-Life.
(FYI, I took the Quake1 engine.. added Half-Life map loading and rendering within about three hours... Half-Life model support took about four days, and adding a mini-WINE dll loader for the gameplay code took about a week. I gave up on the project when it came down to having to keep it up-to-date with Valves patches)
- Ender
Founder, http://www.quakesrc.org/
Project Leader, http://www.scummvm.org/
Tenebrae! (Score:1)
Half-Life (or better yet, Counter-Strike) with Doom III graphics would just own.
http://tenebrae.sourceforge.net [sourceforge.net]
Re:Tenebrae! (Score:2)
Besides which, Tenebrae's code is just a proof of concept, and not remotely suitable for a real workable port.
That, and I'm too busy with ScummVM to help anyone implement this stuff, it's very disjointed and hacky. And I'm already revered, thanks
(If anyone does want to play with this type of thing theirselves, check the Forums section of my QuakeSrc site. Both the half-life map rendering code, and an older version of my Half-Life MDL renderer, are floating around in several engines. I'm sure someone on the forums would be happy to help you.
Your on your own on network protocol and merging the WINE stuff tho. I done this port some years ago, and I don't have much of the source left)
- Ender
Founder, http://www.quakesrc.org/
Project Leader, http://www.scummvm.org/
Re:Tenebrae! (Score:2)
I really think that a Counter Strike realism factor could rock, but what about the new Nvidia graphics language everyone keeps talking about? Maybe it won't be hard for us to make leaps toward a fast net CS game with super cool graphics. Maybe the new code would enable the CS team to split off from the Half Life engine and do their own thing?
If you look at the stats on Gamespy [gamespy.com], Counterstrike takes the biggest share of online use. Take that and think about Valve and how much money they made on the backs of that project!
*shudder*
To clear up my point, I think that a glossy Doom3 version of CS might be immersive, but it might also lack the networking capability that the current version has. CS today doesn't take up much resources on higher end systems, or even lower ones. Today it's pretty cheap to get CS enabled.
Maybe that plays a factor.
Re:Non-threaded programs (Score:5, Interesting)
Whether or not multithreading will accelerate any particular program has to be determined case-by-case. And for most software, the deciding factor should be whether threads will simplify development and correctness (theoretically they can, but lots of developers don't understand threads and use them wrong).
My company has some realtime networked game for which threading was an impediment. Both the rate/duration of screen refreshs and network transmissions were low enough so they didn't usually interfere with each other in the same thread. But using thread-safe versions of standard library functions was degrading every other part of the program with constant locking/unlocking.
So nonthreaded was faster. (Maybe cleverer people could've made special thread-unsafe alternative functions to use in contexts where we know inter-thread race conditions won't occur. But munging around with 2 standard libraries in one program is riskier than we'd like to deal with)
Re:Non-threaded programs (Score:4, Insightful)
Many coders are disinclined to use threads, because they don't necessarily improve code speed.
Further there are a number of examples where writing a single threaded application has definitive benefits. For example applications where deadlocks or race conditions would be an integral problem in a multithreaded implementation whilst a single thread has none of these problems.
Re:Non-threaded programs (Score:5, Informative)
That's a common myth. In fact, there are some kinds of deadlock that do go away, but there are also some kinds that merely change their shape. For example, the need to lock a data structure to guarantee consistent updates goes away, and so do deadlocks related to locking multiple data structures. OTOH, resource-contention deadlocks don't go away. You might still have two "tasks" contending for resources A and B, except that in the non-threaded model the tasks might be chained event handlers for some sort of state machine instead of threads. If task1 tries to get A then B, and task2 tries to get B then A, then task1's "B_READY" and task2's "A_READY" events will never fire and you're still deadlocked. Sure, you can solve it by requiring that resources be taken in order, but you can do that with threads too; the problem's solvable, but isn't solved by some kind of single-threading magic.
I've written several articles on this topic for my website in the past. In case anyone's interested...
Re:Non-threaded programs (Score:2)
Re:Non-threaded programs (Score:2)
Re:Non-threaded programs (Score:3, Insightful)
This is only a factor with a poor multithreaded design. By contrast, single-threaded programs always fail to take advantage of multiple processors, no matter how well they're designed otherwise.
Re:Non-threaded programs (Score:2)
Re:Non-threaded programs (Score:3, Informative)
A well-designed multi-threaded implemention will organize its thread usage in such a way that under light load and/or on a single processor it will not have significantly more context switches than a single-threaded equivalent. Under such conditions it will exhibit the same performance characteristics as that single-threaded version, and yet it will also be able to take advantage of inherent parallelism and multiple processors when they exist. Been there.
Bad multithreaded implementations schedule so many computationally active threads that TSE switches are inevitable. Bad multithreaded implementations force two context switches per request as work is handed off between "listener" and "worker" threads. Bad multithreaded implementations do lots of stupid things, but not all multithreaded implementations are bad. The main overhead involved in running a well-designed multithreaded program on a uniprocessor is not context switches but locking, and that will be buried in the noise. Done that.
A handful of extra context switches per second and a fraction of a percent of extra locking overhead are a small price to pay for multiprocessor scalability.
Trivial, but stupid. You really will context-switch yourself to death that way, as every occasion where you need to coordinate between processes generates at least one IPC and every IPC generates at least one context switch (usually two or more)...and those are complete process/address-space switches, not relatively lightweight thread switches. That's how to build a really slow application.
No, it doesn't. There's simply no comparison between the speed of using the same memory - often the same cache on the same processor, if you do things right - and shipping stuff across the network...any network, and I was working on 1.6Gb/s full-duplex 2us app-to-app round-trip interconnects five years ago. Writing software to run efficiently on even the best-provisioned loosely-coupled system is even more difficult than writing a good multithreaded program. That's why people only bother for the most regularly decomposable problems with very high compute-to-communicate ratios.
Using separate processes instead of threads on a single machine might allow your other processes to stay alive if one dies, but your application will almost certainly be just as dead. The causal dependencies don't go away just because you're using processes instead of threads. In many ways having the entire application go down is better, because at least then it can be restarted. When I used to work in high availability, a hung node was considered much worse than a crash, and the same applies to indefinite waits when part of a complex application craps out.
Re:Non-threaded programs (Score:2)
Look at it this way: the best current transistors can switch at a rate of 200 GHz. That's a clock period of 5 picoseconds. These transistors will be in mass production within 20 years. (Possibly within a few years, but let's be pessimistic.) An electrical signal can travel only about 2 millimeters in that period of time. That means your arithmetic-logic unit cannot even fetch an operand from the L1 cache in a single clock cycle. These processors will use asynchronous message-passing logic at the very core, and farther out the system will be entirely based on messages. HyperTransport is the writing on the wall.
Also consider current-generation transactional systems. The request-producer doesn't have to block waiting for the first response: it can keep queueing requests. If the scheduler is good, this amounts to batching up a bunch of requests, processing them in one fell swoop, then sending the responses in one fell swoop. Of course whether this actually happens depends on the workload and the OS.
Incidentally, I'm typing this on a Unix machine that runs the graphics subsystem in a separate process. Every screen update requires at least one context switch. Does it suck? Not at all. The X11 protocol allows requests to be batched up and handled all at once. Whether the application draws a single character or renders a 3-D scene doesn't have much influence on the context switch overhead. Again, the appropriate solution depends on the workload, and the proper messaging abstraction can make separate processes quite practical. And if your compute jobs cannot possibly fit in a single machine, you have no choice but to use multiple processes.
Not at all. What about Monte Carlo simulations? Losing the occassional random process is irrelevant. What about artificial intelligences running on really big machines? Getting erroneous answers from subsystems, or not getting timely answers at all, will be a fact of a-life. Being able to terminate processes at-will will be critical to building reliable AIs. What about graphics systems? Losing a frame or a part therof is nothing compared to a full crash. What about speech reconition systems? A momentary interruption for one user is nothing compared to a total disruption of all users.Even in the present day, there are plenty of practical workloads that can withstand a subsystem dying, but would rather not see the whole system die hard. If the system is built on a foundation of multithreading, the only failure mode is a total crash.
Re:Non-threaded programs (Score:2)
No, you're looking toward the past. In the future, multi-CPU machines will become more common, not less, and learning to use them efficiently will also become more important. Within the box, multithreading will perform better than alternatives, even if there's message passing going on between boxes.
Yes, I'm somewhat familiar with transitions between message passing and shared memory. Remember that fast interconnect I mentioned, from five years ago? It was at Dolphin [dolphinics.com]. On the wire, it was message passing. Above that, it presented a shared-memory interface. Above that, I personally implemented DLPI/NDIS message-passing drivers because that's what some customers wanted.
The fact is that whatever's happening down below, at the programming-model level it's still more efficient to have multiple tasks coordinate by running in the same address space than by having them spend all their time packing and unpacking messages. The lower-level message-passing works precisely because it's very task-specific and carefully optimized to do particular things, but that all falls down when the messages have to be manipulated by the very same processors you were hoping to use for your real work.
...between nodes that are internally multi-processor.
Yes, yes, using parallelism to mask latency. Yawn. Irrelevant.
If context switches aren't all that bad, why were you bitching about context switching in multithreaded applications? Hm. The fact is, a context switch is less expensive that a context switch plus packing/unpacking plus address manipulation plus (often) a data copy. Your proposal is to use multiple processes instead of threads, even within one box. When are you going to start explaining how that will perform better, or even as well? When it won't "suck", to use your own charming phrase.
Please try to pay attention. I already referred to regularly decomposable applications with high compute-to-communicate ratios, and that's exactly what you're talking about. Yes, what you say is true for some applications, but does it work in general? No. As I said, I've worked in high availability. I've seen database app after database app, all based on message passing between nodes, lock up because one node froze but didn't crash. Everyone's familiar with applications hanging when the file server goes out, and that's not shared memory either. Message passing doesn't make causal dependencies go away.
Simply untrue. I've seen (and written) plenty of multithreaded applications that could survive an abnormal thread exit better than most IPC-based apps could survive an abnormal process exit.
Re:Non-threaded programs (Score:2)
As I also pointed out, the minimum time to send a round-trip signal from one CPU to another is determined by the speed of light. Suppose you have two processors that are 6 centimeters apart. The round-trip time for light is 200 picoseconds. Electrical signals are about half that fast, or 400 picoseconds. Therefore the act of merely acquiring an inter-thread lock will waste 80 clock cycles waiting for the atomic lock instruction to execute, assuming there is no contention. After that the thread will perform memory reads on shared data. Each read from a cold cache line will have an additional 80 wait states while the data is snooped from the hot cache. Complex activity can easily touch 50 cache lines, which is 4,000 clock cycles.
And that's the best case scenario where the programmer has flawlessly layed out the variables in memory to minimize cache transfers. In the real world it is appalling easy to cause cache ping-pong, where two processes try to use the same cache line, and it keeps "bouncing" back and forth between multiple CPUs.
Oh, and that's assuming a zero-overhead protocol for coherency. Realistically you should expect a few hundred picoseconds or so of additional round-trip latency.
Finally, these numbers are for one node that contains a few CPUs. Going between nodes will be vastly worse. A four nanosecond cable (i.e., a one foot cable) between nodes means about 1000 wait states to acquire a lock. A two microsecond RTT (e.g., Myrinet) means 400,000 wait states.
A shared memory space implies a coherency mechanism. A coherency mechanism implies round-trip messages. ("I need this, can I have it?"The thing is, I/O latency isn't improvable. You can only put chips so close together, and the speed of light is sort of non-negotiable. CPU speed, however, is improvable. So the ratio of I/O latency to clock period is going to keep increasing. Code that doesn't batch up data will not run any faster on the machines of tomorrow.
I wasn't bitching about it. I was correcting the misstatement that its influence on performance was zero. I was pointing out that all applications that want maximum performance will be designed that way. They will do whatever it takes to deserialize the algorithms. If several CPUs need to know the results of a simple calculation, they will each calculate it themselves, because calculation it in one place and distributing the results would take a thousand times longer.Take a look at what the Linux kernel is doing sometime. Everything is moving to per-processor implementations. Each processors gets its own memory allocator, thread/process queue, interrupt handlers, and so forth. Inter-CPU locks and shared memory are avoided like a plague. They know that the I/O-to-core clock ratio is bad, and it's going to get much, much worse.
Re:Non-threaded programs (Score:2)
Gee, maybe all that's why people are going to things like multiple cores on one die, hyperthreading, etc. Programming-wise, these present the same interface as multiple physical CPUs, but they also ameliorate many of the problems you mention...speaking of which, everything you're presenting as a "killer" for multithreading is even worse for your multi-process model.
As a former Dolphin employee, I have to point out that Myrinet was never that fast.
You remember that little thing about using parallelism to mask latency? Most serious programs outside of the scientific community are I/O bound anyway and the whole point of multithreading is to increase parallelism.
Getting rather far afield here, aren't you?
You're forgetting that an operation's effect on performance is a product of its cost and frequency. Go read H&P; it'll spell this out for you much better than I can.
What about those that can't? That's a lot of applications, including important ones like databases which must preserve operation ordering across however many threads/processes/nodes are in use. You can't just point to some exceptional cases that are amenable to a particular approach, and then wave your hands about the others. Well, you can - you just did - but it doesn't convince anyone.
Since you're practically an AC I can't be sure, but odds are pretty good that I know more about what the Linux kernel is doing and excellent that I know more about what kernels in general are doing. Your appeal to (anonymous) authority won't get you anywhere.
The real point that we started with is your claim that any multithreaded application will "suck". That statement only has meaning relative to other approaches that accomplish the same goals. Are you ever going to get around to backing that up, or will you just keep going around and around the issue in ever-widening circles hoping I'll get nauseous and quit?
Re:Non-threaded programs (Score:2)
I wonder if databases will be that hard, though. Good ones already provide on-line replication and failover, multi-version concurrency, and transactions that automatically roll-back if a collision is detected.
Did I say that all programs will be easily adaptable to that approach? No. I said that those programs that do not adapt will tend to have poor performance. Ah, I state an inarguable and easily-verified fact, you talk about how much more you know. Smooth move, Ex-Lax. Look, Mr. Smarty Pants, I don't have the time to write a frickin tutorial on things that are common knowledge, where the reader can find enlightenment at their nearest search engine nearly as fast I can write an essay on the topic.But since you can't be arsed to do it, here's a link to search Google for "Linux per-cpu" [google.com]. See? Tens of thousands of hits. If you add "allocator" to the search criteria, this [lwn.net] is the first hit. It assumes background knowledge, but should make sense.
Ladies and gentlemen, we have just lost cabin pressure.The statement was "If you have a few million tasks to do, pretty much any threading system is going to suck."
The context makes it clear that "tasks" means "things that make the threads talk to each other". And it will, indeed, truly and royally suck.
Re:Non-threaded programs (Score:2)
...and anything else - most especially the approach you espouse - will truly and royally suck more. Life just sucks sometimes; too bad, you lose.
Re:Non-threaded programs (Score:2)
It isn't zero, though. If you have a few million tasks to do, pretty much any threading system is going to suck.
Re:Non-threaded programs (Score:2)
No, it's not zero but (as I said; *sigh*) it's a small price to pay to get multiprocessor scalability.
On the contrary, a threading system is highly likely to be the only way you'll meet your cost/performance goals, because a well-written multithreaded server running on a 4-way server will get you nearly 4 CPUs worth of performance. One written your way will require a rack full of machines and fast switches plus some custom state-management code, and might still thrash itself to death no matter how much hardware you throw at it. Please, educate yourself. Do some reading, run some tests, get some experience...then come back and we can talk about which approaches can handle millions of requests per minute.
BTW, thanks for visiting my site. Where's yours?
Re:Non-threaded programs (Score:2)
Re:Non-threaded programs (Score:2)
But even aside from that, a lot of widely-used C APIs are not reentrant, and can't be made so because of the use of static data, which makes it hard to add support via libraries. gethostbyname() is particularly annoying (for some reason, up until recently, gethostbyname_r() wasn't even in the man pages on my Linux box, and the parameters to the damn thing vary between Solaris and Linux...). errno doesn't lend itself well to multithreading. strtok() isn't either...
And while I'm talking about C/C++ issues with threads, I'm not much of an SML fan, but the speed of spawning and destroying threads in CML (the concurrent extensions) makes the C-based pthreads look pathetic.
Re:Non-threaded programs (Score:3, Insightful)
This is partly a matter of taste, but I dislike languages that are excessively large. That is, when given the choice between implementing a feature in the language itself or in the standard libraries (which are built in, or at least interfaced via, the same language) you should try to use the language you already have.
Academics prefers this because it follows principles like Occam's Razor and MDL (minimum description length, an artificial intelligence related term for program quality).
This simplifies your language definition, but transfers some complexity to your library documentation- which is optional reading for learning the language. And it makes the language more extensible in the future. The classic example that C++ advocates pick on is Java's String class. Two Java Strings support the "+" operator to concatenate them as a special language feature. But 3rd party library developers cannot support "+" with their own Objects, like complex numbers or string-like series of non-character data.
The same argument can be applied (with much more complexity and opportunity for disagreement or plain old error) to the question of including threading support native in the language, rather than as an external library. Language-supporters may say "The language natively provides the CPU's logical, arithmetic, and memory management operations. Threads are just as fundamental, and should go there too". The Library guys respond "No useful program lacks logic,arith,and memory. But we've gotten by fine for decades without threads. They're OPTIONAL. And not all OSes support threads- you want to make them incompatible with your language then?"
It goes back and forth, but winds up with a pro-Library argument backed up by programming language theory- language support for threads offers no more expressive power than library support, so they should be kept in the standard libraries. So C/C++ adopted this approach (or rather, C++ kept the C approach as it had been justified).
It sounds great to theorists, who think that even C++'s 4 styles of parentheses are redundant and excessive compared to what's used in Lisp. But outside of conceptual language design, there's a large practical problem which has retarded the performance of C++ programs to this day: backwards compatibility. Specifically, compatibility of new source code with old linkers. (This problem applied somewhat to the acceptance of other compiled languages besides C++)
To get any acceptance, new versions of C++ needed to be compatibile with user's existing C libraries. And to reduce the workload of C++ compiler developers, they made C++ compilers fit into the C workflow (compile, compile... & link) as directly as possible.
But that undermines one big assumption of the "provide important features IN the language, not AS the language" crowd- the assumption that the compiler is very, very good. Their quality metric ignored the ease of the compiler making good binary code from your source- as long as the language has the ability to express their intent compactly and unambiguously they're happy- but that intent may not be clear if the computer isn't looking at the whole program.
A compiler can make global optimizations if it considers the whole program at once, avoiding the function-call overhead of using external functions for core features. But the C developement processs- only giving the compiler small sections of code at once, and then depending on a separate program to link them together- means that the compiler simply can't make the best choices, burdened with incomplete information. (Today, we sometimes have smarter linkers which support more function inlining and const propagation, but they're a poorer solution than using compilers all the way through).
So, this lack of super-good compilers is why pulling more features into a language definition has been helpful, even though plenty of CS graduate theses say it shouldn't be so.
I don't think C/C++ is a good language either- except to implement other languages in! (Where "other languages" may include all the graphics, networking, compression, and other low-level code that Ada95 programs access via C bindings). And to make the academics most happy, that language should be Lisp or ML, which can then be used to write any other compiler/interpreter you might wish.
There are other valid reasons why C++ is still heavily used, but they're mostly shortsighted and based in legacy compatibility ("We've always written desktop applications that way!")
Re:Non-threaded programs (Score:5, Informative)
As to further licencees of the engine, revamping the engine to use multithreading was probably not a very high priority in making a game.
On the other hand, for someone writing an engine from scratch is a different matter.
Re:Non-threaded programs (Score:1)
The problem wasn't fixed in Half-Life...
Heh.. I just wanted to point out that they probably need to port it to linux before they can take advantage of the elite linux multithreading.
Re:Non-threaded programs (Score:3, Informative)
Troll! ;)
---
(And yes, Mike Abrash did WinQuake, not Carmack)
Re:Non-threaded programs (Score:4, Insightful)
Multi-threading is an easy way to cut down response latency in programs and produce a responsive UI. Unfortunately, it also has many drawbacks -- it can actually be slower (due to having to maintain a bunch of locks...you're usually only better off with threads if you have a very few), and it's one of the very best ways to introduce very hard to debug bugs.
I do think that a lot of GTK programmers, at least, block the UI when they'd be better off creating a single thread to handle UI issues and hand this data off to the core program. Also, when doing I/O that doesn't affect the rest of the program heavily, it can be more straightforward to use threads -- if you have multiple TCP connections running, it can be worthwhile to use a thread for each.
There are a not insignificant number of libraries that are non-reentrant, and have issues with threads. Xlib, gtk+1 (dunno about 2), etc.
Threading is just a paradigm. Just about anything you can manage to pull off with threading you can pull off without threading. The question is just which is cleaner in your case -- worrying about the interactions of multiple threads, or having more complex event handlers in a single-threaded program.
The other problem is that UNIX has a good fork() multi-process model, so a lot of times when a Windows programmer would have to use threads, a UNIX programmer can get away with fork().
So you only really want to use threads when:
* you have a number of tasks, each of which operates mostly independently
* when these tasks *do* need to affect each other, they do so with *large* amounts of data (so the traditional process model doesn't get as good performance).
* You have more CPU-bound "tasks" than CPUs, so you derive a benefit from avoiding context switching that characterizes the fork() model.
* you are using reentrant libraries in everything that the threads must use.
Re:Non-threaded programs (Score:2)
Actually, I believe the point is that Linux will have excellent thread support. In other words when 2.6/3.0 is stable enough for production use (2008? :) )
Re:Non-threaded programs (Score:3, Interesting)
What a load of crap. There are plenty on threaded applications for linux. The problem is that all these inexperienced threads-every-fucking-where-programmers" that Java spawned fail to understand that threading is NOT the solution for everything.
Besides in unix style coding few tings are as common as forking about, which in many cases is the what people also do with java all the time. Real single memory space cloned processes (i.e. threads) have less uses than people actually think.
The worst example of this was the Quake I source code, which was used for many games, including Half-Life. The code was not multi-threaded, and the network code sat idle while everything else drew -- adding about 20ms of lag, unless you cut the frame rate down to about 15 or so.
If you'd EVER actually used threads in linux, you'd know that if there are busy threads you would still get to run atmost once in 20ms and even more likely far less seldom.
It's easy to try out even. Write a code that preforms usleep(1) or sched_yields() every now and then and checks how long that takes. Especially try out the case by putting few totally separate processes in the back ground doing while(1); loops. There's your 20ms and way more...
When quake 1 was written the 20 ms lag was concidered NOTHING. At that time online gaming was limited to mainly xpilot and muds. It started the boots and naturally the demands changed, too. THUS ID wrote the Quake World client which was quite different.
Besides a brutal fact always is that single thread process can be made faster than _ANY_ multithreaded approach, although it's often quite difficult. Moreover, threading is never chosen as an approach due to performance, but rather because it simplifies the structure in some cases.
Given the amount of optimization already present in Quake 1, I feel quite safe saying that lack of threads in Dos had jack to do with Quake 1 being single threaded.
Re:Non-threaded programs (Score:2)
approach, although it's often quite difficult. Moreover, threading is never chosen as an approach due to
performance, but rather because it simplifies the structure in some cases.
Apparently you never use multi-CPU computers?
In any case, for some tasks, raw speed isn't as important as low latency. By using multiple threads with a good scheduler and a well-thought-out priority system, you can end up with a very responsive program, something which would be much harder to do with a single thread. See BeOS's GUI for a good example.
Re:Non-threaded programs (Score:2)
Re:Non-threaded programs (Score:2)
Re:Non-threaded programs (Score:2)
Where do you *live*?
Half-Life has been the foundation for the most popular FPS (unless UT took it away for a bit) for years. The original Half-Life was perhaps the first popular FPS to have a decent story (ignoring, of course, the elderly and much-revered Marathon [bungie.org] (I challenge *anyone* to find a game that has developed a fan base as fanatical as the one that has developed the site at the other end of this link -- look at the "Facts and Puzzling things about..." sections)). Half-Life vastly changed the face of the FPS industry, and brought in much more scripting and plot work to FPSes, leading to impressive newer games like Max Payne [3drealms.com].
Half-Life spawned two short sequels from Valve (Opposing Forces and Blue Shift). It is the game that the phenomonally popular Counterstrike mod was made for.
Half-Life is also notable for its very quick software renderer, its introduction of the newer "multiple weapons per slot" weapon inventory system, and popularizing manually-triggerable reloading.
Just hope they think about users first... (Score:1)
I thought that was what OSS was about, getting the best of all worlds?
--
Regards,
Helmers
I don't see why the two are mutually exclusive. (Score:5, Insightful)
But, again, I might want to write a server of some sort which handles hundreds of thousands of connections at once, but 99% are idle at any given time and the other 1% require some nontrivial processing sometimes and/or a long stream of data to be sent without prejudicing the other 99%. Now, for ANY 1:1 threading system, I can't just create x * 10^5 threads because the overhead would be colossal. But equally so, implementing this with poll() is going to be horrid, and if the amount of processing done on a connection is nontrivial and/or DoS'able, there's going to be tons of hairy context management code in there, until lo and behold you end up with a 1:N or M:N scheduling implementation yourself. NGPT could be very useful as a portable userspace library here, as these people have implemented an efficient M:N scheduler under GPL, something that hasn't existed before and could be very useful. I think these libraries might be much more complimentary than the article makes out.
Re:I don't see why the two are mutually exclusive. (Score:2, Informative)
Re:I don't see why the two are mutually exclusive. (Score:1)
Re:I don't see why the two are mutually exclusive. (Score:1)
Re:I don't see why the two are mutually exclusive. (Score:3, Interesting)
Because there's no need. Since AIO functions on the concept of callbacks, your callback will be called when the operation completes. Completion may be "errored-out" or it may be "completed". Adding the house-keeping for these is a no brainer should you really need to have them. After all, you already have to track AIO context to some degree (buffers and perhaps state). Keeping track of your desired information is trival at this point.
Re:I don't see why the two are mutually exclusive. (Score:5, Informative)
Actually, it's kind of famous for that [slashdot.org].
Re:I don't see why the two are mutually exclusive. (Score:3, Informative)
If you read the article, it shows benchmarks done by the NPTL folks which shows a 2x improvement in thread start/stop timings over NGPT (which itself is a 2x improvement over POLT (plain old Linux threads)).
Read more about NPTL here [redhat.com] (PDF file).
Re:I don't see why the two are mutually exclusive. (Score:2)
Why is the context management code going to be so hairy that you end up with something as complex as a 1:N or M:N scheduler?
Re:Why poll? or why M:N? (Score:2, Insightful)
For systems that don't require true high-end scalability, 1:1 works fairly well. It's because of this that M:N has some proponents.
Re:Why poll? or why M:N? (Score:2)
So your POINT is actually incorrect. The correct statement is that the 2.6 kernel when using a 1:1 model will allow for greater scalability than it's previous implementation but has yet to prove it addresses all workloads better than all other threading models. In fact, it's expected that M:N will still have it's place for some types of workloads even after 2.6 is released.
I personally expect that 1:1 and M:N will both become more common and that there is a time and place for both.
Re:Why poll? or why M:N? (Score:2)
IIRC, BSD has had a O(1) scheduler for a very long time now. Many RT OS's have also had them. This is evolutionary for Linux and is not a "real advance in scheduling tech". It is, however, a real advance on an evolutionary basis, for Linux.
So are they both useful? (Score:5, Interesting)
I can easily imagine that one of them might be more efficient for gigantic numbers of threads that don't individually do much, or maybe one might be more efficient for very large numbers of processors, but I don't know jack about the issues involved, so I'm just talking out my ass. (Hello! I'd like to "ass" you a few questions!)
So, someone who knows... Are these threading systems good for different things? And would it really be that hard to make them both come with the kernel?
Re:So are they both useful? (Score:5, Insightful)
They both implement the POSIX threading API (a good thing IMHO). NPTL is more radical; the IBM team made a conscious decision to keep the impact of their changes to the minimum. For that reason, I expect that NGPT will be accepted; it has a shorter path to deployment in production systems, even though NPTL is a more "correct" solution (i.e. it uses purely kernel threads). But it changes userspace, libc and the kernel - it will be much harder to verify.
Are these threading systems good for different things? And would it really be that hard to make them both come with the kernel?
Developers shouldn't care, or more accurately it doesn't matter for them. Both implement POSIX threads, so it simply depends what is installed on the system on which their code ends up running - the same application code will work the same on both, altho' each will have its "quirks". Sysadmins will prefer the NGPT because it is easier to deploy and test. Linux purists will prefer NPTL because a) it's the "right" way to do it, and b) it was written by Red Hat.
They could both come with the kernel source and you could choose one when you compiled it. I don't see how they could coexist on a single system.
Re:So are they both useful? (Score:3, Interesting)
It seems to me that the NPTL will smoke the NGPT. The author of the article is just being diplomatic. Keep in mind that Ulrich is/was a key developer on both. Usually when a good engineer changes his approach to solving a problem, it's because he has found a better solution.
Re:So are they both useful? (Score:2, Informative)
Yes and No (Score:5, Informative)
By sticking with the 1:1 solution that's currently used in the kernel and the NPTL model, there's really only the kernel scheduler to worry about, making things run a lot more smoothly generally. I'd imagine latency being a big issue with M:N (I'm pretty sure that it was mentioned in the whitepaper). I haven't read the other side of the issue, but I think that pretty graph in the O'Reilly article says it all performance-wise.
There are other issues though, like getting full POSIX compliance with signal handling. The 1:1 model apparently makes signal handling much more difficult (I don't know anything about the POSIX signaling model, but there's a paper about it on Drepper's homepage [redhat.com] that could probably shed some light on the subject if you were so inclined. There are other issues in the current thread model that have to be dealt with in a new 1:1 model (and are) such as a messy
From the whitepaper, it seems that the development of the O(1) scheduler was meant to facilitate the new thread model they've developed, which I hadn't thought about before even though it makes sense. There's still some issues to work through, but both models look promising. If the signal handling issues can be resolved it looks like from the article that NPTL's model will win on sheer performance.
As for making them both come with the kernel, that's really really difficult, since this stuff touches on some major pieces of the kernel like signal handling. The same way you're only going to get one scheduler and VM subsystem, you're only going to get one threading model. You're able to patch your own tree to your heart's content, but as per a default install, there can be only one.
Re:Yes and No (Score:2)
I'd personally like to see the NPTL kernel mods with the NGPT libraries. This would seem to provide the most forward-looking approach, as it offers lots of scalability and flexability.
I personally can't wait for the 2.6 kernel, whichever model wins. Java apps are nice, but they tend to use way too many threads. I really don't know why a select() wasnn't present in the beginning for a langugae designed to be used in a networing environment. Oh well.
Re:Yes and No (Score:2)
So you get the best of both worlds: the common case is handled with no overhead--not even a single system call--while scheduling is handled by an omniscient 1:1 kernel scheduler.
Meta comment: Why the fuck does Slash strip out HTML non-breaking space character entities? More proof that the /. operators don't care about content, they just want to milk the site for advertising money.
LWN (Score:5, Informative)
Someone should start a site.... (Score:1, Redundant)
I was aware of the debate about the linux threading issue, but the kernel mailing list was too noisy to pick out this kind of detail.
Someone should start a site that covers long term issues, rather than the week by week stuff I've found on the web... or maybe someone has, and I'm just too out of the loop....
Re:Someone should start a site.... (Score:4, Informative)
KernelTrap [kerneltrap.org].
Re:Someone should start a site.... (Score:4, Informative)
Oh crap, I wish I didn't have to say this... (Score:1, Flamebait)
I guess since Windows sucks so bad at shitloads of processes, programming on 4 or more CPUs, you really quickly have to learn how to write multithreaded code that works, and correctly. You poor Unix guys are struggling through something we all went through years ago -- learning how to think more sophisticated than a single thread of control correctly.
CPU-bound tasks (spinning in a loop calculating PI) are easy to saturate all the resources using any model. How come y'all are switching to a thread-based model now? Was the other way running out of steam?
Honestly curious...
Re:Oh crap, I wish I didn't have to say this... (Score:5, Interesting)
Solaris is to blame for all this (Score:2)
1:1 is a cleaner, simpler model.
Re:Oh crap, I wish I didn't have to say this... (Score:2, Informative)
Multics had multithreading in the early 1970s.
Windows was still launched from DOS in 1992.
Please go back to your "innovating" with Windows.
-Kevin
Re:Oh crap, I wish I didn't have to say this... (Score:2)
Re:Oh crap, I wish I didn't have to say this... (Score:2)
Yes, threads can be misused and abused, but for some problems they're the right tool. All the developers in this article are trying to do is develop a hammer that won't break the first time they hit a nail with it (IMHO, Linux's inability to deal with massive numbers of threads without severe performance degredation makes the threading implementation approach uselessness. I'm glad someone is fixing it).
Re:Oh crap, I wish I didn't have to say this... (Score:3, Insightful)
Correctly programming threads is hard, so they should only be used when necessary. Many of the things that can be done with threads can be done more safely with fork() and/or select(). Since Windows lacks the former and has a broken version of the latter, Windows programmers tend to use threads when Unix programmers would use an alternative.
Re:Oh crap, I wish I didn't have to say this... (Score:3, Insightful)
What the heck does altering the structure of a thread *library* have to do with application-level thread programming? What are you talking about?
Re:Oh crap, I wish I didn't have to say this... (Score:2)
It's far more instructive for those of us a little lower on the learning curve to observe these architectural questions in public, than to have them made for us by Those Who Know Best.
Prediction: both implementations have their appropriate place. In the ideal world, you'd install the proper one based on the application.
The one that becomes the usual suspect in commercial distributions is likely the one that is best suited for business tasks, which aren't that thread-happy in the first place, if my guess isn't too far off...
How about scheduling & thread-specific storage (Score:5, Interesting)
scheduler does not immediately respond to priority changes
thread-specific storage access is slow
There is a well known effect in multi-threaded programming called priority inversion that can cause deadlocks when a low-priority thread has acquired a resource that a high priority thread is waiting for, but a medium priority thread keeps the low priority thread from beeing executed and so the medium priority thread effectively gets more cycles than the high priority thread.
One way to overcome this problem is to use priority ceiling locks where the priority of a thread is boosted to a ceiling value when it acquires a lock. Unfortunately I found that changing the priority of a thread for a short interval does not have any effect at all with the current 2.4.x standard pthreads implementation.
The second problem I ecountered is that accessing thread-specific storage with pthread_getspecific() takes 100-200 processor cycles on 1 Ghz PIII, which makes this common approach to overcome hotspots almost as slow as locking.
Does anyone know if any of these issues are adressed by the new implementations ?
p.
Re:How about scheduling & thread-specific stor (Score:3, Informative)
Regarding thread specific data access: If your LinuxThreads library uses floating stacks (for ix86 this means it has been built with --enable-kernel=2.4 and for i686) it already will be faster.
For other TLS enhancements take a look at http://people.redhat.com/drepper/tls.pdf [redhat.com].
Re:How about scheduling & thread-specific stor (Score:3)
Regarding changing of priorities, I think that with SCHED_OTHER the priority is beeing automatically modified by the scheduler to distibute cycles in a more fair fashion.
I tried both SCHED_RR and SCHED_FIFO and changing priorities basically works, but it seemed to me that changing priorities did not have an immediate effect as required to implement priority ceiling locks.
For example, when boosting the priority of a thread to the ceiling priority, and the thread is the only one with this priority, I expect it to run without beeing preempted by anyone before the priority is lowered or the process blocks. On the other hand, when lowering the priority, I expect a higher prio thread to be executed immediately. I would also expect the order of unblocking threads is correctly adjusted when their priority was changed while suspended.
However, it seems that priority changes do not much affect the actual timeslice or the unblocking order, but I did not have the means to find out what exactly happens; using a debugger is outright impossible with fine-grained multi threaded programs.
Is it possible that some system thread needs to run inbetween to do some housekeeping ? Do you have any hints about the scheduler's inner workings ?
Thank you
p.
Kernel vs user doesn't make sense (Score:4, Interesting)
Mode switching times. (Score:2, Informative)
Re:Mode switching times. (Score:2, Insightful)
Re:Kernel vs user doesn't make sense (Score:3, Informative)
Whenever execution switches between user mode and kernel mode, a context switch is required. Context switches are expensive.
Inidentally, this is one of the advantages of the microkernel approach: by severely limiting the code that must be run in kernel space, you can minimize context switches between kernel and user mode and save a lot of time.
Re:Kernel vs user doesn't make sense (Score:3, Informative)
The context switches are increased because a single operation (say, and I/O read) requires switching into the kernel from the user process, and then out into a device driver. A non-microkernel would have the device driver in the kernel. This is just an example - it may be that the switch is to the file system manager instead, or some other helper process. The point is that the nature of a microkernel is to have lots of helper processes that perform what are normally macro-kernel functions.
Context switches typically are expensive because they involve more than just a switch into kernel mode. They are likely to involve some effort to see if there is other work to do (such as preempt this thread). They may involve some privelege checks, and some statistical gathering.
A microkernel just does less of this stuff.
BTW... the first elegant running micro-kernel I ran into was the original Tandem operating system. The kernel was primarily a messaging system and scheduler (I think scheduling *policy* may have been handled by a task, btw). I/O, file system activity, etc was handled by privileged tasks. It was very elegant, and conveniently fit into their "Non-Stop (TM)" operation.
and not a moment too soon (Score:4, Interesting)
I bet the 1:1 package would have finer-grained context switching, though. M:N models tend to switch thread contexts only during I/O or blocking system calls. With finer-grained thread switching you tend to expose more bugs in multithreaded code, which is a very good thing. But I suppose even in an M:N model you could always set M=N to acheive similar results.
Linux processes are not LWP's (Score:2, Interesting)
This is not true. Every process in Linux is a full-weight process. The fact that some of those processes may map to the same memory space does not make them in any way lighter to the other parts of the kernel. What Linux does have is a lightweight clone() system call.
Solaris is an excellent example of the differences between processes, lightweight processes (lwp's or kernel threads), and user threads.
Linux will prevail (Score:3, Insightful)
I am not well-versed in the world of Linux, ( have my own allegiances [mammals.org] but am being drawn to it more and more. Reading the article, it felt very clear to me that Linux will prevail (with a nod to William Faulkner's Nobel speech [nobel.se]).
Consider a few quotes from the article:
Perhaps others have already pointed this out [tuxedo.org], but I am newly impressed with the universal nature of Linux. The power of an operating system that *everyone* is interested in improving, and has the opportunity to improve, is awesome. Yes, Microsoft has tremendous resources, and very earnest, good-willed, brilliant people. But to improve Microsoft's kernels, you have to work for Microsoft. That means switching the kid's schools, moving to Redmond, etc. etc. On the other hand, everyone from IBM to HP to some kid in, say, Finland, can add a good idea to Linux. When the kernel's threads implementation is a topic for conversation at conferences, with multiple independent teams coming up with their best ideas, Linux is sure to win in the long run.
I'm struck by the parallels to my own field of scientific research: Yes, the large multinational companies have made tremendous contributions in materials science, seminconductors, and biotech. They work on the "closed-source", or perhaps "BSD" model of development. But it is the "GPL"-like process of peer-reviewed, openly shared, and collaborative academic science that has truly prevailed.
I thought it was 3.0? (Score:2)
Re:I thought it was 3.0? (Score:3, Informative)
In one of the discussions with Linus on this issue he said there was a planned change that broke something but it wouldn't be in for this version. Because that would warrant a major version change of its own, he didn't want to go from 2.5 to 3.0 then from 3.3 or something to 4.0, he'd rather go from 2.9(or so) to 3.0, and avoid the version inflation.
I agree. There's no stigma in having a product numbered 1.x or 2.x, it simply means you got it right early on, without needing to break old applications too often.
Re:I thought it was 3.0? (Score:2)
FWIW, it's looking to be a hell of a kernel.
Compare to Solaris evolution (Score:5, Insightful)
The change in thinking for this is argued in this Sun Whitepaper [sun.com], and this FAQ [sun.com].
If one believes the Sun guys have a clue, you can take this as a vote in favor of 1:1.
IMO, anyone who runs more than about 4*NCPUS threads in a program is an idiot; the benchmarks on 10^5 threads are absurd and irrelevant.
Once you run a reasonable number of threads, you can be quickly driven to internal queueing of work from thread to thread; and by the time you have done that, you may already have reached a point of state abstraction that lets you run event driven in a very small number of threads, approaching NCPUs as the lower useful limit. Putting all your state in per-thread storage or on the thread stack is a sign of weak state abstraction.
-dB
Re:Compare to Solaris evolution (Score:3, Insightful)
>>>>>>>>>
Typical *NIX developer. Threads are useful for two things:
1) Keeping CPUs busy. This is where the whole NCPU business comes from.
2) Keeping the program responsive. *NIX developers, with their fear of user-interactive applications, seem to ignore this point. If an external event (be it a mouse click or network connection) needs the attention of the program, the program should respond *immediatly* to that request. Now, you can achieve this either by breaking up your compute thread into pieces, checking for pending requests after a specific amount of time, or you can just let the OS handle it. The OS is going to be interrupting your program very 1-10 ms anyway (timer interrupt) and with a good scheduler, it's trivial for it to check to see if another thread has become ready to run. The second model is far cleaner than the first. A thread becomes a single process that does a single specific task. No internal queueing of work is necessary, and threads split up according to logical abstractions (different tasks that need to be done) instead of physical ones (different CPUs that need to be kept busy).
Re:Compare to Solaris evolution (Score:3, Insightful)
If your application design calls for 100 concurrently operating threads, there is something broken about the decomposition.
-dB
Re:Compare to Solaris evolution (Score:2)
-dB
Re:Compare to Solaris evolution (Score:2)
I'd look at it a different way, and count likely-to-be-active threads. If that number's much greater than NCPUS you're probably hurting yourself, but threads that are certain or almost certain to be sleeping (e.g. periodic tasks that are between invocations) don't really count. I discuss this issue more in an article on my website [pl.atyp.us] about avoiding context switches (part of a larger article about server design) if you're interested.
Re:Compare to Solaris evolution (Score:2)
-dB
Re:Compare to Solaris evolution (Score:2)
This isssue is thinking through the scalability implications of using lots of threads. What the application ends up doing is relying on the thread scheduler to make proper policy decisions. If the appliation maintains it's own work queue(s) internally, it is in far better position to make correct decisions. Yes it's more work, but it works better.
-dB
Re:Compare to Solaris evolution (Score:2)
Neither. ;-) Listener/worker separation is a fundamentally flawed model; try symmetric [pl.atyp.us] multithreading [pl.atyp.us] instead.
how to tell if processes are threads? (Score:2)
Haven't we gone through this before? (Score:3, Insightful)
While it certainly is useful that IBM is using its experience with a MxN threading model to improve Linux, this is not some new, ground-breaking concept. The other UNIX operating systems made this move anywhere from 5-10 years ago.
My guess is the Redhat 1x1 model will prevail, as the only platform that MxN apparently benefits is the IA-32 platform, because of its 8192 threads per process limitation. I gather Itanium and Opteron do not have this problem.
However, both threading models could continue to be included, as Sun offered two threading models in Solaris 8 for SPARC.
We should also realize that all of these scalability improvements, while opening Linux up for bigger server hardware, will probably adversely affect performance on a single CPU workstation.
Maybe we need two distinct, but compatable threading models, once focused on the workstation environment, where one is not typically running a bizillion threads, and one for the server environment, where the Java's and the Oracle's of the world can produce threads until their hearts are content.
Re:Haven't we gone through this before? (Score:2)
Re:Call me stupid but (Score:3, Informative)
Basically, while Linus was incommunicado sailing across the ocean, someone got jumpy and suggested 3.0 should be the next step.
It might be more likely that it proceeds through 2.10 and higher before going to 3, though. Just to confuse the people who think version numbers are floating-point.
Re:Call me stupid but (Score:4, Interesting)
I don't think the 2.6 vs. 3.0 debate is over yet. But it seems to be quiet right now. I think the discussion will live up when the release date starts getting close about a year from now. And I even think there will be discussion after the release, because the version number come as a complete surprise to some people. And I will not try to guess how much doubt will be in Linus mind once he actually wants to release the thing.
But if you want to be unambigious when talking about it, you should call it 2.6. If it turns out not to be the case everybody will know that it was indeed 3.0 you were talking about. But if you talk about 3.0 already now, and it turns out to be called 2.6, then 3.0 might be something else released in the future.
Re:Call me stupid but (Score:2)
First of all make that if it breaks binary compatability with 2.4, because if binary compatibility is not preserved between 2.4 and the latest 2.5 kernel, there will be lot of incompatibilities within the 2.5 series as well. And what happened during the development is not what we want to talk about anyway.
But what do you have in mind, when talking about binary compatibility? AFAIK Linus does consider binary compatibility on the kernel/user mode interface to be important. Deprecated interfaces might be removed at some time, but before that it might even have produced warnings for some time if any programs used it.
When I upgraded from 2.2 to 2.4 binary compatibility was preserved (mostly), I did not replace any libraries or executables because of the kernel change. But I did find a few problems: rpc.rstatd would core dump on the first request, because of a change of the format of a
I don't know how many interfaces changed between 1.2 and 2.0, I never used 1.2 my first kernel was 2.2. But I know that the reasons for Linus to choose 2.0 as version number was primarily the addition of SMP and portability to different platforms. Yet it might have remained binary compatible with 1.2. I suspect the executable format used by 1.2 was aout while today all executables are ELF. But kernels still have optional support for aout binaries. But actually I think even ELF support is optional, so I could build two 2.4 kernels with one supporting aout and the other supporting ELF, they will be same version number but obviously binary incompatible.
Don't expect any Linux kernel in the future to introduce major binary incompatibilities with your previous kernel. There will be changes, but they will be slow, so most executables will be usable across three or more kernel releases. It shouldn't be hard to find an executable working on 2.0, 2.2, 2.4, and the latest 2.5.
Re:And so it begins ! (Score:1)
Re:Site down or not found (Score:1)
Re:Site down or not found (Score:2)
Come on people..mod him back down.
LOL.
Re:GNU/HURD (Score:3, Funny)
More realistically, like communism GNU/HURD is a great idea in theory, but the only available example right now is run by fascist lunatics and doesn't work...
AC because I'm too cruel