Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Intel Open Source Linux

Linus Torvalds Hopes Intel's AVX-512 'Dies A Painful Death' (phoronix.com) 160

"Linux creator Linus Torvalds had some choice words today on Advanced Vector Extensions 512 (AVX-512) found on select Intel processors," reports Phoronix: In a mailing list discussion stemming from the Phoronix article this week on the compiler instructions Intel is enabling for Alder Lake (and Sapphire Rapids), Linus Torvalds chimed in. The Alder Lake instructions being flipped on in GCC right now make no mention of AVX-512 but only AVX2 and others, likely due to Intel pursuing the subset supported by both the small and large cores in this new hybrid design being pursued.

The lack of seeing AVX512 for Alder Lake led Torvalds to comment:

I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on.

I hope Intel gets back to basics: gets their process working again, and concentrate more on regular code that isn't HPC or some other pointless special case.

I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota.

Because absolutely nobody cares outside of benchmarks.

The same is largely true of AVX512 now - and in the future...

After several more paragraphs, Torvalds reaches his conclusion. "Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can."

Phoronix notes that Torvalds' comments came "just weeks after he switched to AMD Ryzen Threadripper for his primary development rig."
This discussion has been archived. No new comments can be posted.

Linus Torvalds Hopes Intel's AVX-512 'Dies A Painful Death'

Comments Filter:
  • by Opportunist ( 166417 ) on Sunday July 12, 2020 @12:39PM (#60290184)

    We at politically correct Linux hope that it will be peacefully euthanized!

  • These abominations need to die; SVE{,2} and RV{32,64}V >> AVX{,2,-512}
  • by mveloso ( 325617 ) on Sunday July 12, 2020 @12:48PM (#60290212)

    FP is so hard for Intel that at one point Apple's SANE 68k library (which ran on their 68k Macs) was faster and more accurate than Intel's hardware FP implementation.

    • But FP also used to be Intel's strength. In the 386DX through Pentium II or so, Intel processors were the cheapest way to get lots of FP. And all of those processors had fairly decent bus bandwidth (of both types.) But AMD clobbered them in that department for a while. As someone whose first PC was the first PC (a 5150) I have found the story of AMD vs. Intel to be one of the most compelling in the [computing] industry. /from my $300 Ryzen3 laptop

      • by Rockoon ( 1252108 ) on Sunday July 12, 2020 @01:11PM (#60290306)

        In the 386DX through Pentium II or so, Intel processors were the cheapest way to get lots of FP.

        The 386DX had no FPU at all.

        Prior to the 80486, you needed to buy a separate math coprocessor for hardware floating point support.

        And with the 486's, only the DX line had math coprocessors built in. The SX's did not.

        • Yeah, you're right, 486. I had 8088, [early] 286, 386DX, 486SX, 486DX... All of them were credible processors in their day, with plenty of power for what they cost. The 486SX was pretty gimpy but it was cheap, and was fine for business applications, if not for gaming.

          • by Megol ( 3135005 )

            The games at that time didn't use the FPU anyway. Changed with the Pentium and Quake.

            • Right, but the 486SX had a bus bandwidth problem. Back before game consoles were PCs, more bus bandwidth was one of their hallmarks, to improve fill rate for example.

              • by caseih ( 160668 ) on Sunday July 12, 2020 @11:47PM (#60292160)

                I'm pretty sure the 486SX was just a DX with the FPU disabled (didn't pass QA). A 487 was just an normal DX that took over from the the 486SX. Either chip was a 32-bit chip with a full 32-bit bus. I don't recall any bandwidth issues with the 486SX.

                The 386SX was the one with the bus bandwidth issue. It was a 32-bit processor with a 16-bit bus to save costs.

        • The fact that Intel, even back then, used the same marketing term to talk about two complete different CPU differences, well... we all know where we are today.
          https://en.wikipedia.org/wiki/... [wikipedia.org]

        • The SX did, but it was disabled.

          • I think I remembered that, intel actually made the 486SX (which only came SMC soldered on the motherboards IIRC) and 486DX basically the same, but burned a fuse on the SX so that the math processor wouldn't work. The DX "upgrade", supposedly just for the purpose of adding a math processor, was in fact a whole other socketed 486DX CPU that hard disabled the 486SX when inserted. That was the very first PC I ever had, 8-bit ISA bus and all, though I never did the DX upgrade. I ended up replacing the motherboar

        • You have to get an x87 coprocessor

      • Wasn't Weitek a significantly better option than Intel's coprocessor for the 386?
      • Apart from what everyone else has said about the history, x87 was a nightmare for anyone which needed FP to be both performant and reliable.

        I had one piece of code which resembled this:

        float x = something();
        if (x > 0.f) {
        assert(x != 0.f);
        do_something_with(x);
        }

        The assertion would regularly fail.

        The reason why is that, depending on the compiler's whims, x could be kept in at register between the initial calculation and the test, but then be spilled

  • by xack ( 5304745 ) on Sunday July 12, 2020 @12:53PM (#60290238)
    Perfectly good older hardware is getting thrown away because they don't support SSE2. Soon AVXless hardware will be as well. Just improve instructions per second instead of making pointless instructions.
    • RISC for the win !

    • Intel's Lakefield doesn't even have AVX, and it just came out.
      • That's because Intel didn't plan any of their vector ISA extensions around heterogenous core configurations. Tremont doesn't support AVX at all, while Sunny Cove supports AVX, AVX2, and some AVX512. They had to disable all ISAs not present on Tremont to prevent binaries from either crashing when trying to issue AVX instructions to a Tremont core or locking all threads to the Sunny Cove core to ensure ISA compliance.

        Tiger Lake supports AVX512, but its successor (Alder Lake) will only support AVX2 thanks to

        • I would have thought that catching the opcode error and migrating the thread to a core capable of executing it might have worked, but maybe it wasn't worth the hassle to them.
    • Perfectly good older hardware is getting thrown away because they don't support SSE2.

      There's nothing perfectly good about hardware so old it doesn't support SSE2, especially considering SSE2 is actually useful outside of benchmarking.

      • especially considering SSE2 is actually useful outside of benchmarking

        Considering that it is a requirement for any AMD64 code, it indeed is actually useful.

      • A lot of modern software actually requires SSE4.1a to run properly. Pentium IIIs need not apply.

    • The newest x86 hardware that doesn't support SSE2 that was ever in common use was the Athlon XP and their related Semprons. So you're dealing with hardware that's at least 15 years old at this point and is most likely pushing 20 years old.

      A high end Athlon XP system is powerful enough to run a modern Linux distro at an acceptable speed assuming a distribution that still supports 32-bit. Windows has already left those systems behind some time ago. Windows 8 and later requires the NX bit (which these proce

  • Sure, these are niche features. But isn't that mostly because our development tools are so high level and abstracted that no can take advantage of it.

    I get that these instructions are probably of minimal use to kernel devs. But they COULD be useful to all sorts of projects if they were written in assembly or were somehow integrated into higher-level stacks.

    Why is it - for example - that Array.map in most languages results in iterated code instead of vector code?

    Intel can't force language devs to build these

    • You seem to have missed the point of the post. The point is that Intel is neglecting what's really important (security) in favor of adding frills to make the process look good. In short, the CPU cores are bad at their primary job and these neat features are superfluous.

      • I get what you're saying, but I'm skeptical that the teams breaking shit with speculative execution et al are the same teams as those adding vector instructions. I'm not convinced abandoning new features will make the old features more stable.

    • I think Linus is more likely considering the lackluster performance advantages of AVX512 vs the SSE's and the rather large chunk of demand for such processing being met by GPU's that can do SIMD much much better.

      Its not like Intel has enough fpu execution units that they can complete 16 single precision multiplications or additions simultaneously.

      Efficiency advantages going from single instruction single data to single instruction multiple data exists, but it falls off sharply once the processor is reso
    • Re:I'm ambivalent (Score:5, Informative)

      by sjames ( 1099 ) on Sunday July 12, 2020 @02:30PM (#60290516) Homepage Journal

      Don't forget the downsides. In exchange for a possible improvement in a single vector operation, you have to spill ALL the registers. If you happen to need to do array.map on 8 element arrays (and nothing else), the new instructions are a big win. Otherwise, you'll burn up most of the speed improvement saving and loading registers. Looks good on a benchmark, not so much in a practical application.

    • If you use a JVM language, OpenJDK's JVM does a really good job of autovectorizing with AVX, AVX2, AVX512, and even NEON targets. Pretty sure they'll be ready to go with SVE2 once it becomes ubiquitous to modern ARM designs.

      • Are there any materials on modern JVMs autovectorizing to AVX instructions? Back in the SSE days I only found out that HotSpot only used scalar instructions for FP math and that was it. If things changed since then, then i missed that (no doubt because I don't really follow Java developments, like, at all).
    • The point of these instructions, is to utilise *all* the compute units of the core in parallel. Since there's no resources left over for your "normal" code to use between these instructions, the CPU can't pipeline any other work. It essentially turns these general purpose cpu cores into something closer to a gpu core.

      Translating any control flow method into something that makes use of vector instructions is a *hard* problem. Take simdjson as an example. There's plenty of servers on the internet, whose pri

      • . Building new vector instructions does have value. , but not when you treat it like the clusterfuck that Intel has for AVX 512.

        They started-out as pure GPU processing units (on Larabee), and didn't even maintain source compatibility between their Phi vector processors and the Core processors until Ice Lake.

        That's TEN YEARS of clumsy mismanagement from Intel, so of course the adoption is mostly non-existent (outside paid Intel updates to a few compute libraries and /software).

        Adoption would be a lot more co

  • >Stop with the special-case garbage
    Machine learning code benefits *hugely* from AVX. Machine learning is no longer special-case garbage.
    • by allo ( 1728082 )

      Machine learning hugely benefits from GPUs.
      Intel should not try to compete with a different field.

      • >Machine learning hugely benefits from GPUs.
        This is because GPU's have many simple processors executing in parallel. AVX provides the same thing.

        AVX instructions support fused multiply/add instructions for eight double precision or sixteen single precision floating-point numbers, or eight 64-bit integers, or sixteen 32-bit integers within the 512-bit vectors for each core in one clock cycle. Think y=mx+b for linear and logistic regression.

        I have a server with 56 cores and it processes 896 matrix row
    • Machine learning code benefits *hugely* from AVX. Machine learning is no longer special-case garbage.

      Only the clinically insane attempt to do machine learning applications on a CPU.

      If a computer is sex then machine learning is going to a BDSM club and doing it on a CPU is going to a BDSM club only so a dominatrix can kick you in the balls once and send you home. Can you find someone who's into that? Yep. Are they screwed in the head? Yep. Is literally anything else better? Yes.

  • Comment removed based on user account deletion
    • by spth ( 5126797 )

      But some other program on the same machine might use it. Maybe some program where performance doesn't matter.

      Then my program suffers from the AVX-512 induced downclocking.

    • The only way to perfect AVX-512 would be to drop it and adopt something that wasn't invented by a band of baboons, such as a proper vector extension.
    • Admit it, SVE2 is better than AVX anything.

  • by stikves ( 127823 ) on Sunday July 12, 2020 @03:09PM (#60290638) Homepage

    They help when 1% of your code is spending 10% of your cycles. But I do not know any compilers that will "automagically" make very good use of them. They are essentially a separate CPU inside the CPU, and have a "foreign" instruction set in order to access them.

    So, yes, if you have profiled your code in production, and found a very sweet spot that can benefit from parallel instructions, they would be extremely helpful. This is true for media players, some network services, and to a point AI.

    On the other hand this could have been handled better. The current design have lots of drawbacks.

  • said that often for the last years, and RISCV V has a way better Vector extension design: https://youtu.be/9e9LCYt3hoc [youtu.be]
  • likely due to Intel pursuing the subset supported by both the small and large cores in this new hybrid design being pursued

    A reference to "pursuit" in such a context is quite a metaphor once. Using it twice in the same sentence physically injures the well-read...

    The lack of seeing AVX512 for Alder Lake

    Gaaa!!!!

  • 32 registers with 512 bit in the middle of the processor waiting to be specially handled are a kind of a tricky thing for general purpose computing.

    OTOH i would imagine that some of the major consumers, especially numerical calculations (FEM, CFD) are happy to have an option where matrix multiplication/inversion etc is 10-20% faster, and one should not neglect that one of the main use of xeons is to be in the "approved supported workstation configurations" of major engineering packages. (yes, that's what th

You can tune a piano, but you can't tuna fish. You can tune a filesystem, but you can't tuna fish. -- from the tunefs(8) man page

Working...