Linus Torvalds Hopes Intel's AVX-512 'Dies A Painful Death' (phoronix.com) 160
"Linux creator Linus Torvalds had some choice words today on Advanced Vector Extensions 512 (AVX-512) found on select Intel processors," reports Phoronix:
In a mailing list discussion stemming from the Phoronix article this week on the compiler instructions Intel is enabling for Alder Lake (and Sapphire Rapids), Linus Torvalds chimed in. The Alder Lake instructions being flipped on in GCC right now make no mention of AVX-512 but only AVX2 and others, likely due to Intel pursuing the subset supported by both the small and large cores in this new hybrid design being pursued.
The lack of seeing AVX512 for Alder Lake led Torvalds to comment:
I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on.
I hope Intel gets back to basics: gets their process working again, and concentrate more on regular code that isn't HPC or some other pointless special case.
I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota.
Because absolutely nobody cares outside of benchmarks.
The same is largely true of AVX512 now - and in the future...
After several more paragraphs, Torvalds reaches his conclusion. "Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can."
Phoronix notes that Torvalds' comments came "just weeks after he switched to AMD Ryzen Threadripper for his primary development rig."
The lack of seeing AVX512 for Alder Lake led Torvalds to comment:
I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on.
I hope Intel gets back to basics: gets their process working again, and concentrate more on regular code that isn't HPC or some other pointless special case.
I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota.
Because absolutely nobody cares outside of benchmarks.
The same is largely true of AVX512 now - and in the future...
After several more paragraphs, Torvalds reaches his conclusion. "Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can."
Phoronix notes that Torvalds' comments came "just weeks after he switched to AMD Ryzen Threadripper for his primary development rig."
*gasp* Don't use those words! (Score:3, Insightful)
We at politically correct Linux hope that it will be peacefully euthanized!
Re:*gasp* Don't use those words! (Score:5, Informative)
No, he's saying he wants them to spend their time and energy improving the CPU for general purpose computing rather than on niche use cases that happen to look good in artificial benchmarks.
Intel seems to want to produce idiot-savant processors.
Re: (Score:2)
No, he's saying he wants them to spend their time and energy improving the CPU for general purpose computing rather than on niche use cases
That is not a reasonable expectation. They are different groups of people with very different skillsets.
Adding these new instructions is just a matter of connecting transistors.
Improving general performance means reducing the step size
You can't just take a bunch of Verilog coders and magically turn them into quantum physicists.
Re: (Score:2)
Perhaps they should connect some of those transistors to look like an extra GP register...
Re: (Score:2)
Perhaps they should connect some of those transistors to look like an extra GP register...
That would require a redesign of the instruction set.
There are only so many bits available in each instruction to indicate the register.
Re: (Score:2)
Not all of the registers in a CPU can be accessed directly through opcodes. See Register Renaming [wikipedia.org]
The extras are used to better allow out of order execution. x86_64 sort of emulates the official x86_64 on a related acual processor implemented in hardware.
Re: (Score:2)
I don't know about floating point, but there's an important bit in AVX512 that's useful in most use cases: 512-bit moves.
That's because 1 cacheline = 64 bytes = 512 bits. Any store of a lesser size will produce a cache miss, and start slow process of fetching from cache/memory, even though it's totally useless because you're going to overwrite that in the very next instructions. But the processor has no way to know what: if you write a 64-bit word, it must read the whole cacheline to be able to preserve t
Re: (Score:2)
Sure, but how often do you need to move/copy exactly 64 bytes cache aligned but don't want to use the existing GP instructions for blockwise moving/copying memory?
Re: (Score:2)
Sure, but how often do you need to move/copy exactly 64 bytes cache aligned but don't want to use the existing GP instructions for blockwise moving/copying memory?
If you ever read a file of more than 64 bytes or you ever write more than 64 bytes to the network at once then yes, block copies are useful.
Re: (Score:2)
That was a 2 part question. Why do you not want to use the existing general purpose instructions for that?
Meanwhile, when reading a file from disk or sending bytes on the network, the disk and network controllers do the memory transfer. Linux has done zero copy network transmission for a long time now.
Re: (Score:2)
If you ever read a file of more than 64 bytes or you ever write more than 64 bytes to the network at once then yes, block copies are useful.
If you care enough about performance to cache-align your file reads and writes, why aren't you using memory-mapped I/O?
Re: (Score:2)
Could you please educate me what "GP instructions" could be useful here?
As for exactly 64 bytes -- that's "at least 64 bytes". I know of only one case where storing a bigger block in one go could be an improvement (and on current hardware assuming only 1-2 threads).
And not only memcpy, but also new stores could be transparently optimized by the compiler. Usually, writing to a struct looks like:
p->foo = x;
p->bar = x*17+y;
p->baz = 0;
p->quux = rnd(16);
If the initialized part of the struct is a
Re: (Score:2)
I'm suspicious about the effectiveness of your optimization. Intel is surprisingly good at optimizing bus traffic, and I don't think they would miss optimizing consecutive writes to the same cache line. Multiple line write-back caches are pretty standard at this point, because it permits doing multiple sequential writes in minimum time.
Also, the way the DRAM works, I'm not sure you can actually avoid the read penalty when the memory line is selected. Thus, the bus traffic is likely using a read-modify-w
Re: (Score:2)
As Cassini2 said, REP MOV.
If they have extra transistors burning a hole in their pocket, they could try an instruction that invalidate's a cache line worth of memory by assigning the cache line without the read. It could even be built in to REP MOV and invoked automatically whenever it would be valid. Then use the masked write for any non-well-formed writes at the beginning and end of the range.
That's the sort of thing Linus would like to see Intel putting resources into rather than AVX-512.
Re: (Score:2)
Re: (Score:2)
Plus stuff like SVE2 handles wide vector math much better than AVX512. AVX and AVX2 are both pretty good. AVX512 is a terrible mess of conflicting support and standards.
Re:*gasp* Don't use those words! (Score:5, Insightful)
Xeon is a server chip. If you double the integer performance, you will nearly double the real performance of the majority of things that run on a typical server.
Re: (Score:2)
That depends on what you're serving and how you serve it.
The modern throw more hardware at it approach tends to get less bang for the buck than the tune it till it runs well approach.
Re:*gasp* Don't use those words! (Score:5, Informative)
No, it doesn't.
All web traffic is purely integer-based ..
All lossless compression is integer-based
All database systems are pure integer..
most encryption is integer based.
All lossy video compression is purely integer-based.
You're only going to see heavy FPU usage for simulation and machine learning - those are literally corner-cases in the server world.
Re: (Score:2)
Media compression/decompression is often be FP based these days (but is also often hardware accelerated on the GPU). Encryption/decryption is always integer based, find it hard to imagine any encryption that would be satisfied with the non-accuracy of FP (but is often also hardware accelerated on the CPU).
But besides media compression and decompression. None of those tasks are suitable for vectorization.
Re: (Score:2)
All lossy video compression is purely integer-based.
Wait, whaaat? Aren't these purely FP transforms?
Re:*gasp* Don't use those words! (Score:4, Insightful)
In theoretical papers, yes. In real-world implementation they are about universally integer, with such rounding as to achieve the optimal result to fit a word size.
JPEG blocks are 8x8 pixels and use up to 8 cosine transforms per block, resulting in the frequency-encoded image (before culling less important components) being the exact same data size as value-encoded, and encoding the source image losslessly but without redundancy. (then the less important frequencies are replaced by zeros and a standard lossless compression is applied; it compresses data with a lot of zeros pretty well.)
Re: (Score:2)
If the integer performance of your CPU doubled, I doubt you'd notice.
There are still people that compile stuff. And compilation is mostly about integer performance (and sometimes memory access).
Re: (Score:2)
Sure. But hopefully you're compiling stuff less than people are using the result. Even then, compiling is pretty parallel. You can just use a lot of cores.
Re: (Score:3)
If the integer performance of your CPU doubled, I doubt you'd notice.
Sure, if you're completely blind you indeed wouldn't notice.
Doubling floating point (which is what AVX-512 does)?
It does no such thing.
Re: (Score:3)
AVX512 does lots of different things. It's a mess. Read up on the ISA subsets sometime. Or better yet:
https://twitter.com/InstLatX64... [twitter.com]
Re: (Score:2)
Sure, it looks like a pretty bad attempt to implement some useful vector instructions. That doesn't excuse Linus saying this:
Intel did that once upon a time. It was called the 486SX.
Re: (Score:2)
Re: (Score:2)
Torvalds doesn't think 128b+ vector extensions - stuff like AVX or xOP - are useful in the user space. I don't know that I really agree with him on that, but he's entitled to his opinion. Intel's approach has been kind of janky. If they had introduced something like SVE2 instead, maybe Torvalds would come around a bit.
Re: (Score:2)
XOP was 128bit. It was based on SSE5 ideas Intel was considering before they went with 256bit AVX instead.
Re: (Score:2)
The SX was just a way to salvage poor yield on the DX. That is, the SX was basically a defective DX with the bad parts disabled and marketing saying "we meant to do that!".
Re: (Score:2)
>The problem is, Linus's definition of "what everyone cares about" seems to be "integer performance."
I care most about arbitrary precision floating point and arbitrary precision integer performance. A lot of the code I use deals with fantastically small or large values, because it is computing probabilities and security margins for cryptographic systems.
So I often use gmp and mpfr and similar libraries in C and Python. Python when I'm working out the algorithms and them C when I need them to go fast and
Re: (Score:2)
You won't notice anything at all unless your code is rewritten specifically to use AVX-512, which is only going to be done for a tiny number of niche use cases until all those millions of processors out there which don't support AVX-512 get aged out.
On the other hand, improving the execution speed of the basic instructions that have been in the processors for years will improve the performance of the code you are already running.
Re:*gasp* Don't use those words! (Score:5, Informative)
No, AVX 512 is total crap. Reposting some good information from someone else way down below:
AVX-512 is useless because it downclocks the entire processor for absolute ages after running just one AVX-512 instruction. AVX-512 is only useful if ALL you are running is AVX-512 instructions, with no other parts of the same software or anything else on the entire machine needing performance.
AVX-512 is great for benchmarking, for that benchmark you generally do not run anything else on the machine when benchmarking, and the benchmark does not do anything useful with the results it gets from the AVX-512 calculations.
Re:*gasp* Don't use those words! (Score:4, Insightful)
That downclocking is only there as an offset set in the microcode. For some reason, Intel has never tried (or never bothered) to introduce fine-grain downclocking on a per-core basis. Intel actually does the same thing for AVX2, but it isn't as noticeable since 256-bit vector math isn't as much of a power hog on . . . really any of the CPUs that support it vs. the first CPUs to support AVX512.
Yes, that is a nightmare scenario for VMs, since technically one VM user could slow down an entire physical CPU just running some AVX512 code.
All that aside though, if you look at how much extra heat is generated and power is drawn running 512b vector math, it gets pretty extreme, hence the need for downclocking. AMD hasn't introduced support for AVX512 yet, but they already support AVX2 which can bring some heat on its own, and AMD's approach is to run a boost algo that references current draw/heat and voltage relative to one another when determining target clockspeed. So it will downclock in AVX2 workloads like an Intel CPU, but only to the extent that the extra current draw mandates that lower voltages be chosen to stay within heat/power targets. Popping in one AVX2 instruction on one core won't blip the entire CPU.
And he's not wrong. (Score:2)
FP has always been Intel's weakness (Score:3, Interesting)
FP is so hard for Intel that at one point Apple's SANE 68k library (which ran on their 68k Macs) was faster and more accurate than Intel's hardware FP implementation.
Re: (Score:2)
But FP also used to be Intel's strength. In the 386DX through Pentium II or so, Intel processors were the cheapest way to get lots of FP. And all of those processors had fairly decent bus bandwidth (of both types.) But AMD clobbered them in that department for a while. As someone whose first PC was the first PC (a 5150) I have found the story of AMD vs. Intel to be one of the most compelling in the [computing] industry. /from my $300 Ryzen3 laptop
Re:FP has always been Intel's weakness (Score:4, Informative)
In the 386DX through Pentium II or so, Intel processors were the cheapest way to get lots of FP.
The 386DX had no FPU at all.
Prior to the 80486, you needed to buy a separate math coprocessor for hardware floating point support.
And with the 486's, only the DX line had math coprocessors built in. The SX's did not.
Re: (Score:2)
Yeah, you're right, 486. I had 8088, [early] 286, 386DX, 486SX, 486DX... All of them were credible processors in their day, with plenty of power for what they cost. The 486SX was pretty gimpy but it was cheap, and was fine for business applications, if not for gaming.
Re: (Score:2)
The games at that time didn't use the FPU anyway. Changed with the Pentium and Quake.
Re: (Score:2)
Right, but the 486SX had a bus bandwidth problem. Back before game consoles were PCs, more bus bandwidth was one of their hallmarks, to improve fill rate for example.
Re:FP has always been Intel's weakness (Score:4, Informative)
I'm pretty sure the 486SX was just a DX with the FPU disabled (didn't pass QA). A 487 was just an normal DX that took over from the the 486SX. Either chip was a 32-bit chip with a full 32-bit bus. I don't recall any bandwidth issues with the 486SX.
The 386SX was the one with the bus bandwidth issue. It was a 32-bit processor with a 16-bit bus to save costs.
Re: (Score:2)
The fact that Intel, even back then, used the same marketing term to talk about two complete different CPU differences, well... we all know where we are today.
https://en.wikipedia.org/wiki/... [wikipedia.org]
Re: (Score:2)
The SX did, but it was disabled.
Re: (Score:2)
I think I remembered that, intel actually made the 486SX (which only came SMC soldered on the motherboards IIRC) and 486DX basically the same, but burned a fuse on the SX so that the math processor wouldn't work. The DX "upgrade", supposedly just for the purpose of adding a math processor, was in fact a whole other socketed 486DX CPU that hard disabled the 486SX when inserted. That was the very first PC I ever had, 8-bit ISA bus and all, though I never did the DX upgrade. I ended up replacing the motherboar
Re: (Score:2)
Even my 286 had a 16-bit ISA bus...
Re: (Score:2)
You have to get an x87 coprocessor
Re: (Score:2)
Re: (Score:2)
IIRC, faster but more expensive.
Re: (Score:2)
Re: (Score:2)
Apart from what everyone else has said about the history, x87 was a nightmare for anyone which needed FP to be both performant and reliable.
I had one piece of code which resembled this:
float x = something();
if (x > 0.f) {
assert(x != 0.f);
do_something_with(x);
}
The assertion would regularly fail.
The reason why is that, depending on the compiler's whims, x could be kept in at register between the initial calculation and the test, but then be spilled
All the extentions are useless (Score:4, Insightful)
Re: (Score:3)
RISC for the win !
Re:All the extentions are useless (Score:4, Interesting)
We'll see soon enough, in less than six months actually.
Re: (Score:2)
Re: (Score:2)
RISC-V is all extensions...
Re: (Score:3)
Re: (Score:3)
That's because Intel didn't plan any of their vector ISA extensions around heterogenous core configurations. Tremont doesn't support AVX at all, while Sunny Cove supports AVX, AVX2, and some AVX512. They had to disable all ISAs not present on Tremont to prevent binaries from either crashing when trying to issue AVX instructions to a Tremont core or locking all threads to the Sunny Cove core to ensure ISA compliance.
Tiger Lake supports AVX512, but its successor (Alder Lake) will only support AVX2 thanks to
Re: (Score:2)
Re: (Score:2)
Perfectly good older hardware is getting thrown away because they don't support SSE2.
There's nothing perfectly good about hardware so old it doesn't support SSE2, especially considering SSE2 is actually useful outside of benchmarking.
Re: (Score:2)
especially considering SSE2 is actually useful outside of benchmarking
Considering that it is a requirement for any AMD64 code, it indeed is actually useful.
Re: (Score:2)
A lot of modern software actually requires SSE4.1a to run properly. Pentium IIIs need not apply.
Re: (Score:2)
The newest x86 hardware that doesn't support SSE2 that was ever in common use was the Athlon XP and their related Semprons. So you're dealing with hardware that's at least 15 years old at this point and is most likely pushing 20 years old.
A high end Athlon XP system is powerful enough to run a modern Linux distro at an acceptable speed assuming a distribution that still supports 32-bit. Windows has already left those systems behind some time ago. Windows 8 and later requires the NX bit (which these proce
Re:sse2 was introduced in 2000 bud (Score:5, Informative)
Re: (Score:2)
Geode is ancient.
Why, in 2020, compile for specific CPU variants? (Score:2)
Better to JIT on the actual machine. That way you are compatible cross machine and can use whatever trick they provide.
Moreover, at run time you have the whole code base which can be inlined for great efficiency without programmer effort or huge header files, not just the current module.
Back in the 1970s your target machine might not have a compiler, or the grunt to run it . But that was long ago. So why not use the modern approach that has now been available for over a decade?
Ah! C/++.
I'm ambivalent (Score:2)
Sure, these are niche features. But isn't that mostly because our development tools are so high level and abstracted that no can take advantage of it.
I get that these instructions are probably of minimal use to kernel devs. But they COULD be useful to all sorts of projects if they were written in assembly or were somehow integrated into higher-level stacks.
Why is it - for example - that Array.map in most languages results in iterated code instead of vector code?
Intel can't force language devs to build these
Re: (Score:3)
You seem to have missed the point of the post. The point is that Intel is neglecting what's really important (security) in favor of adding frills to make the process look good. In short, the CPU cores are bad at their primary job and these neat features are superfluous.
Re: I'm ambivalent (Score:2)
I get what you're saying, but I'm skeptical that the teams breaking shit with speculative execution et al are the same teams as those adding vector instructions. I'm not convinced abandoning new features will make the old features more stable.
Re: (Score:2)
Speculative execution exploits like Meltdown had nothing to do with AVX, AVX2, or AVX512.
Re: (Score:3)
Its not like Intel has enough fpu execution units that they can complete 16 single precision multiplications or additions simultaneously.
Efficiency advantages going from single instruction single data to single instruction multiple data exists, but it falls off sharply once the processor is reso
Re:I'm ambivalent (Score:5, Informative)
Don't forget the downsides. In exchange for a possible improvement in a single vector operation, you have to spill ALL the registers. If you happen to need to do array.map on 8 element arrays (and nothing else), the new instructions are a big win. Otherwise, you'll burn up most of the speed improvement saving and loading registers. Looks good on a benchmark, not so much in a practical application.
Re: (Score:2)
If you use a JVM language, OpenJDK's JVM does a really good job of autovectorizing with AVX, AVX2, AVX512, and even NEON targets. Pretty sure they'll be ready to go with SVE2 once it becomes ubiquitous to modern ARM designs.
Re: (Score:2)
Heresy (Score:2)
Repeat often: C is faster than Java because it precompiles everything for a generic target architecture.
Re: (Score:2)
. . . except that it doesn't.
Re: (Score:2)
Here was one of the materials I saw on it years ago:
https://docs.huihoo.com/javaon... [huihoo.com]
If you actually try using it on relatively-simple loops and know how to manipulate the JVM via heuristics, it can work nicely.
Re: (Score:2)
The point of these instructions, is to utilise *all* the compute units of the core in parallel. Since there's no resources left over for your "normal" code to use between these instructions, the CPU can't pipeline any other work. It essentially turns these general purpose cpu cores into something closer to a gpu core.
Translating any control flow method into something that makes use of vector instructions is a *hard* problem. Take simdjson as an example. There's plenty of servers on the internet, whose pri
Re: (Score:2)
. Building new vector instructions does have value. , but not when you treat it like the clusterfuck that Intel has for AVX 512.
They started-out as pure GPU processing units (on Larabee), and didn't even maintain source compatibility between their Phi vector processors and the Core processors until Ice Lake.
That's TEN YEARS of clumsy mismanagement from Intel, so of course the adoption is mostly non-existent (outside paid Intel updates to a few compute libraries and /software).
Adoption would be a lot more co
AVX is no longer special case (Score:2)
Machine learning code benefits *hugely* from AVX. Machine learning is no longer special-case garbage.
Re: (Score:2)
Machine learning hugely benefits from GPUs.
Intel should not try to compete with a different field.
Re: (Score:2)
This is because GPU's have many simple processors executing in parallel. AVX provides the same thing.
AVX instructions support fused multiply/add instructions for eight double precision or sixteen single precision floating-point numbers, or eight 64-bit integers, or sixteen 32-bit integers within the 512-bit vectors for each core in one clock cycle. Think y=mx+b for linear and logistic regression.
I have a server with 56 cores and it processes 896 matrix row
Re: (Score:2)
Facebook loves bfloat16
Re: (Score:2)
Machine learning code benefits *hugely* from AVX. Machine learning is no longer special-case garbage.
Only the clinically insane attempt to do machine learning applications on a CPU.
If a computer is sex then machine learning is going to a BDSM club and doing it on a CPU is going to a BDSM club only so a dominatrix can kick you in the balls once and send you home. Can you find someone who's into that? Yep. Are they screwed in the head? Yep. Is literally anything else better? Yes.
Re: (Score:2)
Re: (Score:3)
But some other program on the same machine might use it. Maybe some program where performance doesn't matter.
Then my program suffers from the AVX-512 induced downclocking.
Re: (Score:3)
Re: (Score:2)
Admit it, SVE2 is better than AVX anything.
They are useful (Score:3)
They help when 1% of your code is spending 10% of your cycles. But I do not know any compilers that will "automagically" make very good use of them. They are essentially a separate CPU inside the CPU, and have a "foreign" instruction set in order to access them.
So, yes, if you have profiled your code in production, and found a very sweet spot that can benefit from parallel instructions, they would be extremely helpful. This is true for media players, some network services, and to a point AI.
On the other hand this could have been handled better. The current design have lots of drawbacks.
He is not the only one (Score:2)
EditorDavid, do you even English? (Score:3)
A reference to "pursuit" in such a context is quite a metaphor once. Using it twice in the same sentence physically injures the well-read...
Gaaa!!!!
I get his point, but.... (Score:2)
32 registers with 512 bit in the middle of the processor waiting to be specially handled are a kind of a tricky thing for general purpose computing.
OTOH i would imagine that some of the major consumers, especially numerical calculations (FEM, CFD) are happy to have an option where matrix multiplication/inversion etc is 10-20% faster, and one should not neglect that one of the main use of xeons is to be in the "approved supported workstation configurations" of major engineering packages. (yes, that's what th
He has said as much (Score:2)
> apart from crypto maybe but Linus is an idiot about that and security in general; sorry but it's true
It's okay, Linus is aware he doesn't know cryptography - he has said as much. It's good that he's aware he doesn't know. Over-estimating one's security knowledge is dangerous.
What he does need some knowledge for is understanding trade-offs between security and other goals, such as performance and ease of use.
Re:that's your job Linus (Score:5, Interesting)
AVX-512 is useless because it downclocks the entire package for absolute ages after running just one AVX-512 instruction. AVX-512 is only useful if ALL you are running is AVX-512 instructions, with no other parts of the same software or anything else on the entire machine needing performance.
AVX-512 is great for benchmarking, in that for benchmark you generally do not run anything else on the machine when benchmarking, and the benchmark does not do anything useful with the results it gets from the AVX-512 calculations.
Re: (Score:2, Redundant)
this might mean something to me if you said what "absolute ages" actually means. i'm sure it's useless 99.9%+ of the time, but there may be specialized applications for it (like ML applications; simd is going to be important). no, i don't think intel is wise to bet the farm on this half-baked chimera but otoh it's none of my business and, frankly, i don't care about Linus Torvalds's opinion on anything but the linux kernel either.
Re: (Score:2)
An AC said the same thing like 4 minutes after you posted this, and I'll tell you the same thing I told the AC: AVX512 isn't the problem. It's the way Intel chooses to balance heat and power loads while doing wide vector math. Intel does the same thing with AVX2, but the downclocking is less-severe since the extra load on the CPU in 256b vector math isn't as intense. AMD handles AVX2 better by choosing a boost algo that adjusts voltage based on current draw and heat and then picks a clockspeed that is kn
Re: (Score:3)
AVX-512 is useless because it downclocks the entire package for absolute ages after running just one AVX-512 instruction.
No, it does not.
AVX/2 on Haswell cores did in fact do this, but like most good Intel bashing, you're bitching about long-since-fixed problems.
AVX-512 is great for benchmarking, in that for benchmark you generally do not run anything else on the machine when benchmarking, and the benchmark does not do anything useful with the results it gets from the AVX-512 calculations.
Complete horse-shit.
AVX-512 is useful for *many* things. AVX instructions are used by all kinds of things that are improved by vector maths.
It's a shame this nonsense was modded up.
Re: (Score:2)
catering to the people who program computers is stupid; it's their job to face these difficulties
I thought we generally tried to make the job of programming computers as easy as possible, since it's already tough enough as it is. Making it extra difficult for no good reason warrants a severe brain damage diagnosis for those thinking it's a good idea to make it extra difficult.
Re: (Score:2)
The kernel can't use non-general-purpose (e.g. SSE/AVX) registers internally, otherwise they would need to be saved and restored every time you enter the kernel.