Jens Axboe On Kernel Development 68
BlockHead writes "Kerneltrap.org is running an interview with Jens Axboe, 15 year Linux veteran and the maintainer of the linux kernel block layer, 'the piece of software that sits between the block device drivers (managing your hard drives, cdroms, etc) and the file systems.' The interview examines what's involved in maintaining this complex portion of the Linux kernel, and offers an accessible explanation of how IO schedulers work. Jens details his own CFQ, or Complete Fair Queue scheduler which is the default Linux IO scheduler. Finally, the article examines the current state of Linux kernel development, how it's changed over the years, and what's in store for the future."
Khmm... Block devices? How quaint! (Score:3, Informative)
FreeBSD dispensed with them altogether years ago...
Character devices only, thank you very much.
*Duck*
No block devices = no disk scheduling? (Score:5, Interesting)
At risk of starting a holy war, is there any reason why one approach would be superior? And do they lend themselves to different methods of scheduling? In TFA, Axboe talks about [1] the scheduling mechanism used in later versions of the 2.6 kernel series, which alleviates a problem that I (and most other people, probably) have run into before.
I'm curious, because although I don't use any of the 'real' BSDs very often -- I spend most of my time (at home, anyway) using either Mac OS X, which uses the Mach/XNU kernel (which is derived from 4.3BSD, although I don't know if the I/O scheduler has been rewritten since then), or Linux with the 2.6 kernel, and it seems to me that OS X's disk I/O leaves something to be desired compared to Linux's.
Does BSD handle I/O differently in some fundamental fashion than Linux? It sounds like, by eliminating block devices, that they basically remove the kernel from doing any re-ordering or caching of data, which makes things "safer" (in the event of a crash) but seems like it would have big performance penalties when using drives that aren't very smart, and don't do a lot of caching and optimization on their own. It seems like getting rid of I/O scheduling altogether is a stiff price to pay for "safety."
[1] (quoting because there doesn't seem to be anchors in TFA)
Re: (Score:1)
Good question.
The FreeBSD people claim that no one is using block devices anyway (source [freebsd.org]):
Err, no (Score:3, Informative)
No; FreeBSD's shifted the buffer cache away from individual devices and into the filesystem/VM, where it caches vnodes rather than raw data blocks. The IO queue (below all this block/character/GEOM stuff) is scheduled using a standard elevator algorithm [wikipedia.org] called C-LOOK. It's showing it's age in places, and there's been some effort towards replacing/im
Re: (Score:3, Interesting)
sounds hard (Score:2)
That sounds REALLY hard. I'd be more interested if there's a development strategy he could recommend re: complex development projects.
Scared me... (Score:2, Funny)
Disagree with Mr. Axboe... (Score:5, Interesting)
If core changes of such magnitude are no longer sufficient to merit a dev branch or even a major point release, why bother with the "2.6" designation at all? Just pull a Solaris and call the next release "Linux 20" or "Linux XX."
-Isaac
Re: (Score:1)
Many such as myself are getting tired
Re:Disagree with Mr. Axboe... (Score:5, Insightful)
If you're the kind of kernel hacker who liked to get yours directly from kernel.org, yes then it sucks. But IMO the kernel has grown too big for just the core devs, think of it as an "extended" kernel team including the distros, where kernel.org releases are "internal betas". I think if you cut it back and expect just kernel.org to deliver stable kernels with the resources they have (which admittingly, they used to) then kernel development will slow way down.
Re: (Score:2)
I live with the fragmentation and vendor lock-in that comes with distro-engineered kernels because I have to, but I don't like it. I'm just saying that
Re: (Score:2)
Re: (Score:2)
Re:Disagree with Mr. Axboe... (Score:5, Insightful)
Don't take this the wrong way, but your complaint sounds a lot like the story about a patient and a doctor:
"Doctor, when I do this, it hurts", and the doctor replies, "Well don't do that".
I mean, if you are following bleeding edge kernels, and complaining that they aren't as stable as you'd like. Why not just follow a vendors kernel? If you use or install "many thousands", you are either maintaining your own de-facto distribution or you are using someone else's distribution. Vendor's do exactly the work you want done on your behalf.
I patiently wait for my vendor kernel, which might be 10 point releases behind integrate bug fixes and then upgrade in a year or two to a much newer point release (I think RedHat has used 2.6.9 and/or 2.9.13 in recent memory)... Incrementing a different number wouldn't really make any difference anyways. At that point it's all semantics, if you know the rules of the game, it's not hard to tell what's dangerous as an upgrade and what's not.
It's not like 2.4.13 (or whatever one in the 2.4 series that introduced series disk corruption) was safe merely because it was a point release... They are safe because somebody took it out back and beat on the kernel for a while and it didn't cause any problems. If you upgrade without proper testing and it breaks, you get to keep the pieces.
Kirby
Re: (Score:2)
Last time I checked, there are approximately 5-10 major distros. The vendors have to do their due diligence on any kernel. It's not like RedHat picks a kernel and sits on it for several years. Cherry picking fixes isn't tons of fun, but it's not impossible and if the bugs don't affect anyone why exactly do they need to be fixed again? It's not like those people don't communicate or you can't work with them if you choose to. SuSE and RedHat guys talks. SuSE and Gentoo talk. Having everyone attempt to
Re: (Score:3, Insightful)
In other words, the previous development model made happy say 1% of people (you) and 99% unhappy (distros and hence people using distros). The current model makes 99% of people happy (distros) and 1% unhappy.
IMO it's was a good
Re: (Score:2)
Where are they now? (Score:4, Interesting)
Exhilarating! (Score:1)
Something exciting about delving in the low level logic that gives you the feeling that there's always something more to learn !
I guess always being two steps behind is the motivation that makes it all worth while.
Wow ... (Score:3, Funny)
In the interview he says he is now 30 years old. Wow that means he started working in Linux at the age of 15 - a real prodigy. A very interesting interview.
Btw, it is nice that kerneltrap.org has finally had a make over. The earlier website design looked rather drab.
Re:Wow ... (Score:4, Informative)
What about the process' priority? (Score:5, Insightful)
I wonder, if the originating process' priority is taken into account at all... It has always annoyed me, that the "nice" (and especially the idle-only) processes are still treated equally, when it comes to I/O...
Re: (Score:1)
Re: (Score:3, Insightful)
Indeed, it does — but should not the I/O-niceness be automatically derived from the process' niceness?
Re: (Score:2)
Re: (Score:2)
nice(1) should be doing that (with the help of the kernel-provided mechanisms) then, in my not so humble opinion. Some kind of ionice can be used for finer tuning, but by default a nicer process should be nicer on everything — IO included.
Re: (Score:2)
I suspect nice(1) was not changed for backwards compatibility reasons. There would perhaps be corner cases where a process expected their fair share of I/O time but didn't need much CPU (e.g., tar zcf scripts for backups?) that would suffer too much or not complete if they were suddenly I/O starved.
Re: (Score:2)
Yes, CPU priority is taken into account (Score:1)
Are you sure they are? See the ionice man page [die.net] here:
CFQ not the default scheduler? (Score:5, Informative)
The anticipatory I/O scheduler is the default disk scheduler. It is
generally a good choice for most environments, but is quite large and
complex when compared to the deadline I/O scheduler, it can also be
slower in some cases especially some database loads.*
Anticipatory is also preselected with a fresh
Re: (Score:3, Informative)
Re:CFQ not the default scheduler? (Score:5, Informative)
Re: (Score:2)
Scheduling better than no scheduling? (Score:5, Interesting)
Reading TFA piqued my interest into I/O scheduling and I've been doing some reading on it, and it seems like there are several competing schools of thought, of which Axboe (and potentially the Linux kernel developers generally) are only one.
An alternative view, such as this from Justin Walker (a Darwin developer) on the darwin-kernel mailing list [apple.com], holds that it's not worthwhile for the OS kernel to do much disk scheduling, since "the OS does not have a good idea of the actual disk geometry and other performance characteristics, and so we [kernel developers] leave that level of scheduling up to the controllers in the disk drive itself. I think, for example, that recent IBM drives have some variant of OS/2 running in the controller. Since the OS knows nothing about heads, tracks, cylinders for modern commodity disks, it's futile to try to schedule I/O for them." (written Mar 2003)
Axboe seems to acknowledge that this may sometimes be the case, because they do have the 'non-scheduling scheduler,' which he recommends only for use with very intelligent hardware. However, it seems like some people think that commodity drives are already 'smart enough' to do their own scheduling.
It seems like determining which approach was superior would be relatively straightforward, and yet I've never seen it done (although maybe I'm just not looking in the right places). Anecdotally, I'm tempted to agree with Axboe, since it seems like, when doing things where several processes are all thrashing the disk simultaneously, my Linux machine feels faster than my OS X one, but this is by no means scientific (they don't have the same drives in them, not working with the same datasets, etc.).
On what drives, and under what conditions, is it advantageous to have the OS kernel perform scheduling, and on which ones is it best just to pass stuff to the drive and let the controller do all the thinking?
Re: (Score:1)
Re: (Score:3, Informative)
IO scheduling is a lot more than that, however. If you have several active processes issuing IO, the IO scheduler can make a large difference to throughput. I actually just did a talk at LCA 2007 with some results on this, you can download th
Re: (Score:1)
Hehe. (Score:2)
Re: (Score:2)
Re: (Score:2)
High disk usage (Score:2)
Sort of :) (Score:2)
[His] is the the part of the kernel that's responsible for making systems slightly less slow during extended disk writes, while the CPU utilization is minimal.
And even that's not quite true, where the scheduler really comes into play is when you have two or more processes trying to access the disk at the same time. During an extended, sustained read or write, the scheduler probably just needs to stay the hell out of the way and pass data as fast as it can.
You could al
Missing Question: How do you pronounce your name? (Score:3, Interesting)
Re:Missing Question: How do you pronounce your nam (Score:4, Interesting)
Re: (Score:2)
Close enough.
Re: (Score:3, Informative)
That is correct, like a "y", rhymes with "mens". I saw another question on the lastname, I typically tell foreigners that it is pronounced ax-bow. Europeans often think the 'oe' is like the Danish "ø", however that is not the case.
Re: (Score:2)
Re:Missing Question: How do you pronounce your nam (Score:1)
But since Jens is a Dane, like myself I'd give it a shot.
Yens Aksbo
Where Yens is pronounced with the pressure on the e.
Yêns
And boe is pronounced without the e, and with the pressure on the a.
âksbo
Hope this helps.
Re:Missing Question: How do you pronounce your nam (Score:2, Informative)
Jens is NOT pronounced "Djens". "J" is pronounced as a Palatal approximant [wikipedia.org] in Danish - just like "y" in English. Yens is somewhat more correct, but the "e" has to be pronounced like the IPA [æ]. Danish is not logic at all. If it was, "Jens" would be spelled with a "æ". Take a look at Jens [wikipedia.org].
IPA: [jæns]
Axboe is more complicated:
Re: (Score:2)
Re: (Score:1)
This is what Slashdot is about (Score:4, Interesting)
BTW, does anyone have a good set of benchmarks of the performance of different IO schedulers when running one or two or three IO intensive tasks, when running one intensive and many small tasks, etc.? That would actually help me decide whether to rebuild my kernel with CFQ.
Also, ionice would have made my old machine much more usable when doing backups... Oh well.
Re: (Score:2)
Is it any different with your new machine? My Athlon X2 (SATA disks, 2GB etc) crawls when I start rsyncing my
Reiser4 (Score:1)
How do I know? Why, it's on the Namesys webpage!
Re: (Score:2, Interesting)