Forgot your password?
typodupeerror
Upgrades Linux

5 Years of Linux Kernel Releases Benchmarked 52

Posted by samzenpus
from the line-them-up dept.
An anonymous reader writes "Phoronix has published benchmarks of the past five years worth of Linux kernel releases, from the Linux 2.6.12 through Linux 2.6.37 (dev) releases. The results from these benchmarks of 26 versions show that, for the most part, new features haven't affected performance."
This discussion has been archived. No new comments can be posted.

5 Years of Linux Kernel Releases Benchmarked

Comments Filter:
  • Windows Kernels (Score:5, Interesting)

    by Anonymous Coward on Thursday November 04, 2010 @09:32AM (#34123456)

    What about running the same study on the Windows kernel from XP to 7?

    • Re: (Score:3, Insightful)

      by coolsnowmen (695297)

      While interesting, it isn't exactly the same; in linux, you can actually just change the kernel, without changing all the services and starting software.

      • Even changing kernels can be problematic if you go back far enough; you start running into problems (as mentioned in TFA) with not being able to build your kernel with the same version of gcc. If it were not for this factor, I would be more interested to see a comparison against the 2.0-2.4 kernels. Having said that, since 2.4.37.10 was only released last September, I would imagine that that should be compatible with current compilers.
  • by edelholz (1098395) on Thursday November 04, 2010 @09:47AM (#34123604)

    They tested in a VM. Now where's the proof that by itself doesn't affect performance in an unpredictable way?

    • by mtippett (110279) on Thursday November 04, 2010 @10:05AM (#34123804) Homepage

      Considering the efforts going into VM these days and the massive deployments in Fortune 500 companies, the performance of VM based systems is predictable. All the testing with Phoronix Test Suite is repeated until there is less than 3% variance between the results - or the result set is discarded.

      Realistically, looking at older kernels on modern hardware is actually a very critical dimension for corporate server environments. There are applications in that space that are deployed and supported only on some old distribution. Being able to achieve and understanding how Red Hat 7.1 will act vs Red Hat 5 is critical for some environments.

      • by chrb (1083577)

        the performance of VM based systems is predictable

        I agree that benchmarking a single VM on a VM host is a valid thing to do, and will give fairly reproducible results. But it can get more difficult with more complex setups. You need to be able to manage the complexity and eliminate or randomise all the factors. Benchmarking a single VM running on a VM host with 20+ other active VMs, with snapshots being created and merged, and with variable network and disk configurations, gets more difficult.

        All the testing with Phoronix Test Suite is repeated until there is less than 3% variance between the results - or the result set is discarded.

        What is the minimum number of replicates for each setup? 3% vari

        • One thing I'm curious about is the kernel configuration these guys used - I couldn't find it. Unless they built the kitchen sink into the kernel in the first place, I find it difficult to see how they could have used the same .config for that many builds.

          Until a year or two ago, I used to be an inveterate kernel stripper; any driver or service that wasn't used or supported by my hardware got ruthlessly taken out. This did leave me with more responsive machines at the minor cost of my time. More recently I
      • by RichiH (749257)

        How do you know that running in a VM doesn't affect one kernel version more than another?

        Being too lazy/stupid to start a machine on bare metal? Come the fuck on.

        Of course, Phoronix being the vile pretend-useful bottom-feeding site that it is, they would never care about making sure there are no outside factors over generating page impressions quickly and cheaply.

        • by mtippett (110279)

          How do you know that running on an AMD doesn't affect one kernel version more than another vs Intel. The same argument stands. It's a machine layer for running code.

          Sure, it's not what you want, but don't consider it completely invalid. There are many people who have interest in virtualized performance.

          • by RichiH (749257)

            > How do you know that running on an AMD doesn't affect one kernel version more than another vs Intel.

            It does, at least if you compile for it.

            > There are many people who have interest in virtualized performance.

            I am amongst them. We run a few hundred VMs.

            > Sure, it's not what you want, but don't consider it completely invalid.

            Not completely invalid. Yet, a very basic mistake in benchmarking was made due to inability and/or laziness which could have a major impact on the validity.
            We are used to this

    • Re: (Score:3, Interesting)

      by chrb (1083577)

      They tested in a VM. Now where's the proof that by itself doesn't affect performance in an unpredictable way?

      If they test in a VM, on only one particular hardware configuration, then the results only apply to that specific test setup. If the fact that the experiments are run inside a VM introduces variability into the results, then this will show up as a large variance. [wikipedia.org] However, having a larger variance does not in itself negate the results - but remember that the results can't be generalised to other configurations - they only apply to this particular setup.

      In order to produce experimental results that can be ge

      • Re: (Score:3, Interesting)

        by mtippett (110279)

        The "get to statistical variance" has been in Phoronix Test Suite for the better part of a year.

        As part of the new work happening with Phoronix Test Suite, and the online aggregation site OpenBenchmarking.org, we'll be looking to expose the raw data and allow people to view a particular set of results in a possible more meaningful way. What is being examined now is raw data (scatter diagram), box plot (percentiles), violin plots (kernel function based), full standard error reporting (error bars, numerical

    • Re: (Score:2, Insightful)

      by Anonymous Coward

      They tested in a VM. Now where's the proof that by itself doesn't affect performance in an unpredictable way?

      Does it matter?
      They are after delta's not absolutes.

      *IF* they test each kernel in the same VM on the same metal then any change is valid. The numbers are abstract, the difference between release is what is key

    • by Anonymous Coward

      They tested in a VM. Now where's the proof that by itself doesn't affect performance in an unpredictable way?

      The real problem with running this kind of comparative benchmark in a VM isn't even predictability. It's that virtualization affects kernel performance in many profound ways. Many performance metrics you might choose to test will depend on the host kernel and virtualization environment and how it interacts with the guest kernel. In other words, you're not testing the performance of the guest kernel in isolation.

      For example, say you use a combination of host and guest which supports native IO (where the g

      • Re: (Score:3, Interesting)

        by arth1 (260657)

        In addition, a VM will use available assigned cores on the host, without locking them 1:1. This changes the behavior quite a bit, especially when it comes to CPU cache. The guest thinks it is running on the same core, but in reality it jumps between them, and has to reload from higher level cache or even memory.

        Worse, from a benchmarking standpoint, hyperthreading will be exposed to the guest as separate CPUs. An intelligent scheduler would want to run distinct tasks on different cores, but can't do so i

  • Obviously... they forgot the bloat feature.
  • by QuantumBeep (748940) on Thursday November 04, 2010 @09:53AM (#34123696)

    It seems almost every benchmark that had any difference was slower in more modern kernels. It's not all sunshine and roses.

    • Actually watching the same graphs you did i concluded the opposite.
      • by olau (314197)

        Yeah, note that some of the benchmarks are measuring bytes/sec so higher is better. :)

    • by timeOday (582209) on Thursday November 04, 2010 @12:43PM (#34125834)
      I would agree it's not all sunshine and roses, but let's at least look a little more closely. There are some disturbing regressions in there, although keep in mind other improvements (such as moving to a journalling filesystem) may come at a cost to performance, which may be justified.

      Better

      • Apache Compilation: 40% less time
      • Disk Transactions: 50% less time

      Worse

      • GnuPG File Encryption: 60% more time
      • time to transfer 10GB via the TCP network loop-back: 100% more time
      • Apache static web page serving: 50% more time
      • IOZone Writes - 20% more time

      Same

      • CAMELLIA256-ECB cipher
      • OpenSSL
      • NASA's NPB
      • TTSIOD 3D rendere
      • C-Ray multi-threaded ray-tracing
      • Crafty, an open-source chess engine
      • MAFFT multiple-sequence alignment test that deals with a molecular biology
      • Himeno Poisson Pressure Solver
      • Blowfish performance with John The Ripper
      • LAME MP3 encoding
      • 7-Zip compression
      • Dhrystone 2
      • FS-Mark
      • IOZone Reads
      • Threaded IO tester
      • Parallel BZip2 compression
    • Re: (Score:3, Interesting)

      by CAIMLAS (41445)

      Not only that, but they only looked at the kernel with a specific version of GCC. Due to this, the performance differences could theoretically be not only accounted for by minute differences in how the compiler handles things.

      The bigger thing with Linux performance isn't just the kernel - it's the entire stack. You've got the kernel, sure - and then you've got the core libraries (glibc, etc.) and the compiler which built them. These all can change performance significantly, and in real-world environments, t

  • It seems that Phoronix needs a faster kernel on their server...

    Seriously though, Some of the performance drops (and how they have been sustained in later kernel versions) makes me wonder if there is adequate load testing as part of the kernel QA process.

    • Re: (Score:2, Insightful)

      by gmack (197796)

      Keep in mind that the biggest drop was most likely do to ext4 adding data journaling rather than the usual medtadata journaling to make file contents less likely to be corrupted after an unplanned shutdown(power outage etc)

      I didn't see any mention of them turning that feature off to find out one way or another.

      • Re: (Score:3, Insightful)

        by ustolemyname (1301665)

        Some off the changes noted in the Linux 2.6.30 kernel change-log that was used throughout the Linux testing process included...

        Yeah, that new EXT4 filesystem that they didn't use for obvious reasons. Huge impact on the results.

        • Sorry, slashcode seems to be blocking my ability to copy paste today (opensuse 11.3, chrome 7 beta, asus 701...)

          change-log for the EXT3 file-system that was used throughout

          Quote is available on the third page of the article, first paragraph.

  • Overkill (Score:4, Funny)

    by TrailerTrash (91309) on Thursday November 04, 2010 @10:35AM (#34124138)

    What more Linux benchmarking do you need besides bogomips? Jeez.

  • by m4c north (816240) on Thursday November 04, 2010 @11:15AM (#34124594)

    Where are the kernel-level tests that do more than exercise the filesystem and network driver (singular) and the scheduler? More than half of those charts were flat, which could mean they weren't making appropriate measurements.

    For example, show how mutexes have improved, or copy-on-write, or interrupt handlers, or timers, or workqueues, or kmalloc, or anything else that a system and kernel programmer would care about. I like the user-centric perspective: it's very good information to have and share, but don't call what you've done a kernel benchmark. Maybe call it a kernel survey of its impact on users.

    • by Timmmm (636430)

      The only thing they changed was the kernel. Performance differences can only be due to changes in the kernel. In what way is that not a kernel benchmark?

      • by jdgeorge (18767)

        It's not a COMPLETE kernel benchmark in that it only exercises certain parts of the kernel.

        And since you obviously needed a car analogy: It's still like ONLY testing how fast a car goes 0 to 60 miles per hour, but not the towing capacity, fuel efficiency, braking distance, or crash performance, and a bunch of other things.

      • by c (8461)

        > Performance differences can only be due to changes in
        > the kernel. ... or to the VM having better support for certain features used in that particular kernel version, or that particular VM being configured in such a way that some kernel run better than others, or the host kernel somehow having better support for some features of the VM and benchmarked kernel, or...

        Which is perfectly fine as long as it's made very clear that the benchmarks are subject to all of those conditions. Personally, I think t

      • Re: (Score:3, Insightful)

        by CAIMLAS (41445)

        IF you were running the tests on real hardware, I'd be more likely to agree.

        They weren't. They were running it on a virtualized host in KVM. This means that not only were their results largely determined by the specific network, etc. drivers they used (which can see significant revision between kernels and not accurately reflect the kernel itself), but any idiosyncratic behavior in KVM in how it treats guest interfaces may account for the discrepancies.

  • ugh (Score:5, Informative)

    by buddyglass (925859) on Thursday November 04, 2010 @12:15PM (#34125408)

    I love that Phoronix is willing to take the time to run tests like this. I just wish they'd learn how to run meaningful tests. For instance, why are they testing a bunch of CPU-bound things? Kernel won't affect that unless we're talking about SMP performance. If you want to test the kernel, test how well it handles SMP, network I/O and disk I/O. And bear in mind that disk I/O will be hugely affected by which filesystem is used and its configurable settings.

    Another problem with their article is that it tests individual kernels. Most folks don't use a vanilla kernel. They use one provided by their distro, which may have distro-specific patches that address some of the performance problems (or add new ones). What I would have preferred to see is a comparison of different distro releases over the last 5 years, focusing on the most popular ones (say Ubuntu, Fedora and SuSE).

    The meaningful tests (and their results) were:

    1. GnuPG: avoid 2.6.30 and later.

    2. Loopback TCP: avoid 2.6.30 and later.

    3. Apache Compilation: avoid 2.6.29 and earlier.

    4. Apache static content: avoid 2.6.12, 2.6.25, 2.6.26, then 2.6.30 and later.

    5. PostMark: avoid 2.6.29 and earlier.

    6. FS-Mark: avoid 2.6.17 and earlier, 2.6.29, then 2.6.33 to 2.6.36.

    7. ioZone: unless you're willing to run 2.6.21 or earlier, avoid 2.6.29 and you're fine.

    8. Threaded I/O: avoid 2.6.20 and earlier, 2.6.29, then 2.6.33 to 2.6.36.

    Based on these results, #1 and #2 seem to be testing the same thing, and tests #3 and #5 seem to be testing the inverse of whatever that thing is. 2.6.29 seems to be especially crappy, performing worse than the kernels immediately before and immediately after it on tests #6, #7 and #8. In terms of recent kernels, tests #6 and #8 suggest a regression in 2.6.33 that has been resolved in 2.6.37.

    If it were me, I'd look at either running 2.6.37 (when its released) or fall back to 2.6.32 if my hardware was supported.

    • Re:ugh (Score:4, Insightful)

      by mtippett (110279) on Thursday November 04, 2010 @01:55PM (#34126934) Homepage

      This made me laugh - in a good way, not at you :).

      When Phoronix does a distro-comparison the crowd calls out that the tests are only really testing gcc differences, and should have less variables changing. When Phoronix does a fixed comparison varying only one part of the system, the crowd calls out that it isn't a good basis since people don't run it that way.

      Phoronix runs tests in different ways to explore the performance landscape. For some it precisely gives the information that they need, for other it's completely irrelevant. In this particular case, I'm glad that the data gave you enough to have some open questions about 2.6.32 vs 2.6.37. If people walk away with those sorts of first order interpretation, the article served it's purpose.

      Of course the next step would be how do we take a tighter look at the delta between 2.6.32 and 2.6.37 - any thoughts?

      Regarding meaningful vs meaningless tests. The tests Phoronix runs are a collection of tests to explore. The tests were run, and for some of them, the results yielded nothing interesting but were still reported. You don't know until you run the tests, and if the tests are run, you report on them. Some tests may be stable now, but may have sensitivity to other parts of the systems. Even CPU bound tests will yield different results in different cases (scheduler, etc).

      • Re: (Score:3, Insightful)

        by TheLink (130905)
        I suspect the scheduler would make a bigger difference if you were running multiple processes at the same time.

        e.g. multiple processes in various scenarios:
        CPU intensive.
        disk IO intensive.
        network IO intensive, single NIC.
        network IO intensive, two NICs.
        network IO intensive, four NICs.
        And various combinations of CPU, disk, network.

        Then latency tests:
        One to X processes with high CPU, while measuring latency experienced by another process.
        One to X processes with high IO, while measuring latency experienced by a
  • What's next, we all believe Eugenia from OSNews when she spews about BeOS? These guys are just page-view leeches, ignore them and they'll wither and die.

  • This comes as no surprise. In any activity which is mostly limited by CPU in user mode, not much changes, you can track that over a number of operating systems. What has gotten slower is disk io and network transfer time, and some tests, such as web serving, may be using all or mostly pages in memory, so this is not as obvious as it might be.

    In addition, the test was run in a virtual machine, so to some extent the huge host memory provided more resources, and the very fast disk hides poor choices in the io

All the evidence concerning the universe has not yet been collected, so there's still hope.

Working...