Red Hat Engineer Improves Math Performance of Glibc 226
jones_supa writes: Siddhesh Poyarekar from Red Hat has taken a professional look into mathematical functions found in Glibc (the GNU C library). He has been able to provide an 8-times performance improvement to slowest path of pow() function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements already went into glibc-2.18 upstream. Siddhesh believes that a lot of the low hanging fruit has now been picked, but that this is definitely not the end of the road for improvements in the multiple precision performance. There are other more complicated improvements, like the limitation of worst case precision for exp() and log() functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision (PDF). One needs to prove that those results apply to the Glibc multiple precision bits.
C versus Assembly Language (Score:2, Interesting)
I don't know ASM much, but my friend whose professional career is in programming told me that to really speed up the speed sometimes one may have to go the ASM way
Is that true?
Re:C versus Assembly Language (Score:4, Informative)
It may, but it's pretty rare that it's worth it and it also increases the cost of maintaining. Though a function in glibc, might be an exception.
Re: (Score:2)
It may, but it's pretty rare that it's worth it and it also increases the cost of maintaining. Though a function in glibc, might be an exception.
There's nothing rare about it. SIMD vectorization is useful in tons of applications.
Re: (Score:2)
really nothing rare about it? maybe in your line of work. But since we're just talking general programming here, it's quite rare.
Re: (Score:2)
We weren't talking "general programming" here.
Re:C versus Assembly Language (Score:5, Informative)
Indeed it is, however it's still rare that you have to go to ASM in those cases. In simple cases the compiler already generates SIMD code on code which can benefit from it, and for almost all other cases, there are C intrinsics.
Re:C versus Assembly Language (Score:4, Informative)
It may, but it's pretty rare that it's worth it and it also increases the cost of maintaining. Though a function in glibc, might be an exception.
There's nothing rare about it. SIMD vectorization is useful in tons of applications.
Yes, and modern compilers are quite good at generating code that takes advantage of extended instruction sets.
Re: (Score:2)
There's nothing rare about it. SIMD vectorization is useful in tons of applications.
That's got little to do with ASM: GCC and others offer compiler intrinsics so you can access the vector instructions from C or C++.
Re: (Score:2)
There's nothing rare about it. SIMD vectorization is useful in tons of applications.
Vectorization is not the same as assembler. You can do vectorization just fine in a high level language. Actually, SSE3 is an absolute pain in the ass in assembler and trying to use it in assembler is absolutely misguided.
Re: (Score:2, Insightful)
Compilers can't really compete with experts on performance, given enough time. If you are unfortunate enough to look into GCC, you can find some assembly here and there on performance-critical portions of the code. The major downsides are: you have to write it for each architecture you want to support (GCC supports many archs, so this is a really big deal), fewer people can read/write/fix it, bugs are even easier to slip by, takes longer to code.
Re:C versus Assembly Language (Score:5, Insightful)
Not just every architecture. In general, you may need to write it for every major revision of every architecture. As CPU pipelines and instruction sets change, the hand-crafted assembler may no longer be optimal.
(Exercise: Write an optimal memcpy/memmove.)
Re:C versus Assembly Language (Score:5, Informative)
Not just every architecture. In general, you may need to write it for every major revision of every architecture. As CPU pipelines and instruction sets change, the hand-crafted assembler may no longer be optimal.
(Exercise: Write an optimal memcpy/memmove.)
I have some math code that I optimized in assembly for Pentium Pro (686) back in the day. The performance improvements vs C are getting smaller and smaller but they are still positive. At least through Core Duo, which was the last time I had access to that code. Whenever we upgraded compilers, or we upgraded our development systems, I would test my old assembly code against the reference C code.
Regarding a case like your memcpy example. An assembly version may still be warranted. A piece of software may need to optimize for the low end computers out there. So if the assembly is a win for the low end and neutral or a slight loss for the high end then it may still be the way to go. The low end is where the attention is needed to expand the pool of computers included in the minimum system requirement, think video games. You have to optimize for the three year old computer, its the one having performance problems, not the new computer. And if it does matter on the high end its simple enough to have earlier generations of an architecture use the assembly and later generations use the reference C code. Fear of future systems is no reason to leave current systems hobbled.
Re: (Score:2)
Regarding a case like your memcpy example. An assembly version may still be warranted. A piece of software may need to optimize for the low end computers out there.
MacOS X has memcpy built into the operating system. At boot time, the OS determines the processor and copies a memcpy (and memmove, and memset) function optimised for that processor to a fixed memory location. If you copy megabytes, you will see it doing _very_ interesting stuff with cache prefetches and so on. C++ doesn't use these, unfortunately.
Re: (Score:3)
Compilers can't really compete with experts on performance, given enough time.
This.
The thing is that the vast majority of us aren't assembly experts.
Compilers kick my ass except for the rarest of highly specific cases (all of them SIMD).
Re:C versus Assembly Language (Score:5, Insightful)
99.9% of the time, no.
The purpose of the compiler is to identify and optimize the code structures in higher level languages. There are many, many tools, and generations of compilers that have been dedicated to just that. For the vast majority of cases, the compiler will do a better job and leave you with the much easier task of maintaining a high level language codebase.
That said, there are specific operations, most frequently mathematical in nature, that are so explicitly well defined and unchanging, that writing them in ASM may actually allow the author to take procedural liberties that the compiler is unknowledgeable of or exist in such a way that the compiler is unaware of.
The end result of such code is typically virtually unreadable. The syntax masks the math, and the math obfuscates the syntax. But the outcome is a thing of pure beauty.
-Rick
Re: (Score:2)
99.9% of the time, no.
Said by someone who doesn't do anything involving multimedia or DSP. Compilers are horrendous at vectorization. It's why things like ffmpeg, x264, libjpeg-turbo, and pretty much any video/audio/image codec that has any decent performance has SIMD assembly for all platforms that support it. Otherwise they would be well more than a magnitude and then some slower.
If you don't believe me compile x264 for without ASM optimizations even at your compilers highest optimization level and watch how it slows to a craw
Re: (Score:2)
Not to mention Intel Performance Primitives which is basically a library of ASM blobs tailored for various combinations of sse/avx - even cache and pipes, I believe.
Re: C versus Assembly Language (Score:2)
shouldn't it be possible to automate the optimization based on architecture constants? number registers, execution width, matrix dimensions, and depth of ooperation? I feel llike if, someone with MLA experience profiled it they'd have no trouble dimensionality reducing and coming up with some tunables.
Re: (Score:2)
Re: (Score:2)
The thing is, + will be compiled into something useful on all platforms the compiler targets. Intrinsics will only become something useful on the platforms they are for. That's why they're closer to assembler than + is.
Don't need to be an expert to beat compilers ... (Score:4, Insightful)
The real problem is you need to be expert in the target processor(s) ...
Not really. Being an expert in assembly language in general may be required but not necessarily an expert on the target architecture. Transitioning from one target architecture to another is not like starting over from scratch. Part of becoming a good assembly language programmer is recognizing things in algorithms that can't quite be stated in a high level language, information or suggestions that can't be given to the compiler. Specific computer architectures provide the toolboxes to address such shortcomings. So a bit of the work is common regardless of the architecture and the rest is determining how to accomplish something with the architecture specific toolbox.
Circa 2000 I was beating PowerPC compilers on my first attempt. Now I had a lot of experience with x86 and some with 68K, 8051, 8048, 6502 and Z80. At the university I had taken undegrad and grad computer architecture classes, the later focused on Alpha. Before writing the "commercial" PowerPC code referred to earlier I spent a couple of weeks reading PowerPC manuals and writing little pieces of test code.
Being new to PowerPC I was concerned that I had missed something or done something wrong despite seriously beating the compiler. I was able to go to Apple and spend a couple days with their engineers. Their PowerPC people thought the assembly code was fine and couldn't really improve upon it, their compiler people couldn't improve the original C code.
That said, assembly language is unnecessary much of the time. Contrary to popular belief C code can be written in a manner that favors one architecture over another. By understanding the given architecture and its assembly language I've been able to rewrite C code and not have to go all the way to assembly. Having two C implementation of some code, one for x86 vs one for PowerPC, or more generally one for CISC vs one for RISC.
Re:Don't need to be an expert to beat compilers .. (Score:5, Insightful)
I once found an Bresenham line drawing algorithm written in Assembly (68k) on a Mac SE (black and white graphics).
That code was extremely complicated and the main mistake was that the author loaded each byte, manipulated it and rewrote it to memory each cycle.
Just to load in many cases the exact same byte again in the next cycle.
I immediately came to the idea to not load and store the bytes every cycle, and from that the jump to figure if you could also use words and long words depending on how steep the line was, was easy.
So I first wrote code in C that used chars, ints and longs depending on steepness and changed consecutive bits until it needed to access another "word". Each "word" was only loaded once into a register and was only written after all necessary bits where changed.
My C code rewrite was so fast (more than 100 times than the original assembly) that I never rewrote that in assembly itself.
Re: (Score:2)
Re: (Score:2)
Also the algorithm is always the place to start. There is little point in optimizing the C code or going to assemble language until you are sure the algorithm is right. Fewer and wider memory accesses is a pretty good direction to explore as
Re: (Score:2)
I like to program in assembly, too. :D ... on the last one only very briefly to get an idea and I forgot everything about the last four ;D - don't remember if I even truly did program on Alpha or only analyzed the compiler output.
But not on X86 architectures. On the other hand I did not do that since 20 years, so no idea if it is now less painful
I did assembly on ARM 3, 6502 - of course - 68k and Sparc, Mips, Dec Alpha, PowerPC
68k (68030/040) I liked the most due to its orthogonal architecture, it was like
Re: Don't need to be an expert to beat compilers . (Score:4, Informative)
This is interesting. Do you have any examples?
Sorry, proprietary code of a past employer. It was in the domain of low level bitmapped graphics.
Its been quite a while but one method that I recall involved branch prediction. That if the condition to be used for a branch is known sufficiently far in advance then a branch on the PowerPC has no penalty. An inner loop had numerous branches whose conditions could be determined quite early. I recall doing so and storing things in extra condition registers. The result was that the inner loop no longer had any branch penalties. Myself and the Apple engineers just couldn't get the various compilers to do anything comparable.
This was just one of several things that I did but I don't really recall the others. Well, except that the rotate and mask instructions can be amazingly useful.
Re: Don't need to be an expert to beat compilers . (Score:4, Interesting)
A good example that I ran into not too long ago was trying to get GCC to autovectorize some heavy matrix multiplication operations without using vector intrinsics. No matter how hard I tried and now matter how explicitly I forced the memory alignment (on x86 double quad-word loads into XMM registers require 16 byte alignment) and ensured that all operations were 128 bits wide (SSE codepath) or 256 bits wide (AVX codepath) GCC just couldn't figure it out on its own. I poured through the compiler output and manage to clear up a few ambiguous data dependencies but I just couldn't get it to autovectorize the main loop.
I ended up digging around in the compiled ASM and noted where GCC was failing to unroll and reorder enough for SLP to work properly. I rewrote a small chunk of it by hand and got the results that I expected but doing so for a large portion of the project would have been unreasonable and also would have bound the source to x86. Instead, I switched from GCC to ICC and ICC picked up the optimization right away. For shiggles I tried Clang/LLVM as well but had no luck with it either.
Re: (Score:3)
Re: (Score:2)
Re: (Score:2)
Call it "The God Optimization".
Assembly more durable than you might think ... (Score:2)
sure thing, if you want to rewrite your code for every cpu architecture ...
Nope. I only need to write assembly for the architectures I care about, all other architectures can have the reference C code implementation.
... (and preferably also every generation of said architecture)
Probably not. Performance tuning may only be necessary for the older architectures in order to lower required/recommended system requirements so the potential market is larger.
Also assembly is often more durable, more long lived, than you might imagine. I have some math code that I wrote in assembly targeting the Pentium Pro (686) back in the day. Every once and a
Re: C versus Assembly Language (Score:4, Informative)
Re: (Score:2)
Re:C versus Assembly Language (Score:5, Insightful)
I think that saying "This piece of code is going to be called a lot, so I'll implement it in assembler" is inadvisable. The more reasoned approach is "after profiling, my program spends a lot of time in this routine, so I'll go over the assembler the compiler generated to make sure it optimized correctly". The upshot being, it is useful to be able to read and write assembler when optimizing, but it would be rare that you would produce new code in assembly from whole cloth.
Re: (Score:2)
On that note, I am pretty sure the optimizing was done in C in this case. The art is in the careful analysis of precision.
first the algorithm, then system, asm (Score:2)
ASM for certain inner loops can be a good sixth step in the optimization process. It only matters after the algorithm is optimized, then optimized again, then the system stuff is right (ie look at how the storage structures used by the program translate to actual physical operations).
Re: (Score:2)
At one time it was absolutely true. Nothing could beat hand assembly. These days, the compiler can do a better job MOST of the time and asm is only used for specialized implementation of low level functions like locking that C wasn't designed to handle.
Re: (Score:2)
To make an algorithm fast, you need to improve the algorithm (so the number of operations that the algorithm needs to perform are minimised), and improve the number of operations per time unit that the algorithm can perform. Assembler _may_ help with the second task. However, writing asssembler code is much much more time consuming, so in a fixed amount of development time, mu
Re:C versus Assembly Language (Score:5, Insightful)
Keep in mind: this is [w]hat the compiler tried to do; when you start down this path you are saying "that fancy compiler doesn't know what its doing, I'll do it all myself".
Trying to outsmart a compiler defeats much of the purpose of using one.
-- Kernighan and Plauger, The Elements of Programming Style
Re:C versus Assembly Language (Score:5, Informative)
Re: (Score:2)
when you've measured and proven the compiler is generating sub optimal code
That's the important part. Don't start mucking around with low-level assembly for things until you've proven that you've got a problem and that the fix you're proposing to work on is worthwhile. (Where a library gets very widely distributed, such as a basic math library, it may well become worthwhile very quickly. Most code doesn't get anything like that level of distribution.)
Re: (Score:3)
If you're doing it on your own time then muck around with anything you want. If you win you are a hero and if fail you got some experience.
Re:C versus Assembly Language (Score:5, Informative)
File a bug report for the compiler with 'missed optimization opportunity' in the title and a reduced test case.
We like to see real-world examples of where we're generating bad code - if we don't see them, we can't fix them.
Re: (Score:2)
If you are unsure and have the time then write it both ways and compare them, no need to guess. If the assembly version wins then make it a compile option, keeping the C code for regression testing and to allow for compiler improvements over time.
Re: (Score:2)
The first is that you need to keep testing. The next version of the C compiler may generate better code than your assembly. Or the version after. Optimisers keep improving (hopefully - occasionally they get worse, but if you want to avoid that then contribute test cases to the performance regression suite), but your assembly won't.
The second is that performance for this kind of code can be highly microarchitecture dependent. One of the FreeBSD devs did some analys
Re: (Score:2)
High-precision math is an excellent time to use assembly language. Assembly languages generally have a way to express ideas like a 32x32->64-bit multiply (and 64x64->128-bit multiply), and add-with-carry. High-level languages generally support neither of those options directly. To tell the compiler that you want a 32x32->64-bit multiply you generally have to have two 32-bit inputs, then cast one of them to 64-bit, and hope that the compiler doesn't actually generate a 64x64 multiply.
That has nothing to do with "hope". Clang and GCC do that without any problems. Clang and GCC also support 128 bit integers on 64 bit processors and do the same there. Worst case, you build some inline assembler functions and use them as building blocks. Then you can do anything in high level language, and leave all the boring details to the compiler.
why ? (Score:5, Funny)
Who needs glibc anymore ? we have systemd now.
Re: why ? (Score:2)
if dungeons can explain why this funny. something about d being next letter after c?
Re:why ? (Score:5, Funny)
excellent (Score:4, Interesting)
Re: (Score:2)
Back in the mid-80s I was involved in the design of a "mini Cray" supercomputer. We did not yet have any hardware to run on, but we did have a software simulator, and we wanted to publish some "whetstone" numbers. We got some numbers, were not too happy with them, and really dug in to analyze what we could do to improve them. The Whetstone code was in C, and used a f
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
It may be the case, though, that it is possible to
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
Re: (Score:2)
As the algorithm is "tail recursive" it is save to assume the compiler will apply "tail recursion optimization", which means it converts the recursive call into a loop.
Re: (Score:2)
Back when I was doing my PhD, the prevailing instruction to undergraduates was to favour iteration over recursion because it's faster. One of my colleagues rewrote some code to a single loop rather than recursive calls and found that on Linux it got slower, on Windows it stayed the same. Looking at the assembly, it turned out that managing his own stack on the stack generated significantly worse code with GCC than just using the call stack. On Windows, it was even more interesting: the Microsoft compiler
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Obnoxious GP happened to be right in this case, provided that the recursion is always applied to the smallest partition first, leaving the potentially much deeper recursion on the other partition for the tail, and provided that the compiler manages to detect the tail regression opportunity, which would need to be verified.
Re: (Score:2)
Re:excellent (Score:5, Informative)
No, you can do quicksort in one recursive call. Even in the two recursive call scenario, an optimizing compiler will turn it into one recursive call and a loop - it's such a basic operation it's called tail recursion.
Basically if you have a recursive function call at the end of the function, instead of making it a recursive call, you can reuse the current stack frame by readjusting the input parameters (on the stack) and jumping to the beginning of the function, saving yourself the headache of setting up and tearing down a stack frame because you're reusing the current stack frame. (or in other words, you're doing a loop).
In fact, this can be generalized into tall call optimization where if you call another function at the end of a function, the optimizing compiler will reuse the stack frame of the current function for the call instead.
Syntactically the code looks the same, but the output is vastly different because you get operations that rewrite the stack frame (a particularly smart compiler might actually put the parameters of the call right where they need to be by moving the arguments around so the final two instructions is a stack adjustment to compensate for different stack sizes and a direct jump, so when the tail function returns, it doesn't stop back at the calling function, but goes back to the original function.
So yes, you're right in that there's supposed to be two recursive calls in quicksort, but in practice, there's only one because the last one is always tail-recursive so compilers merely reuse the existing stack frame.
Re: (Score:2)
Re: excellent (Score:2)
so how much better was it?
Re: (Score:2)
I was shocked to find how poor the performance of expf() was compared to exp() in glibc. Turns out that in a handful of functions, they are changing the rounding mode of the FPU, which flushes the entire FPU state, obliterating performance. After switching to a different version -- from another library -- that didn't change rounding modes, performance was back on par.
It's perfectly understandable why rounding mode changes are necessary, since the FPU can be in any rounding mode coming in, and some guarantee
Re: (Score:2)
I assume they're already fairly well optimized
You assume wrong. glibc is pretty poorly optimized in reality for most things. It works on god knows how many systems and is generally consistent, but optimization is not its strong point.
If speed is your concern and you have a choice between Linux with glibc and commercial OS, you pick the commercial OS and its libraries almost every time. I can't actually think of a case where this doesn't apply.
It could be a lot worse, but its certainly not as fast as it could be just about any where. I'm fairly cer
Re: (Score:3)
I don't know what the situation is like in GNU land, but in FreeBSD libc, we basically have two guys who care about libm (the library that implements the math.h stuff). One is a retired mathematics professor, I'm not sure what the background of the other is. There are a few other people who occasionally contribute to it, but no one else really wants to change it because floating point code is really hard to understand. You need to find someone who:
Re: excellent (Score:4, Informative)
Re: (Score:2)
How much benefit? (Score:4, Insightful)
It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.
Re:How much benefit? (Score:5, Informative)
It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.
From TFA, it sounds like the functions in question (pow() and exp()) work by first looking at a look-up table/polynomial approximation technique to see if the function can use that to get an accurate-enough value, then running through a series of multiplications to calculate the result if the table/approximation method wouldn't be accurate enough. Since this work improved the actual calculation part, my guess is that it will improve quite a few cases. TFA does say the lookup table method works in the "majority of cases", though it doesn't say exactly how big a majority, so it's hard to say exactly.
Re: (Score:2)
Curiously, the Red hat dev did not comment on average case performance improvement, only on the slow path improvement. I initially missed that in a quick reading, as, I suspect, did many others.
Re: (Score:3)
Curiously, the Red hat dev did not comment on average case performance improvement, only on the slow path improvement. I initially missed that in a quick reading, as, I suspect, did many others.
It is difficult to compute impact of this work on the average case because we don't know precisely how many of the inputs in the entire domain (i.e. all unique FP representations in an IEEE754 double) are serviced by the slow path. I wasn't able to get the libm code to go down the mp path for the log function after a week of throwing different inputs at it. pow on the other hand hit it fairly regularly in an hour. Even the 'fairly regular' is about 1 in a 1000 inputs, so that is not much. We know that it
Re: (Score:2)
Don't tell me that you've never written an algorithm which uses speculation before. It's quite a common scenario that you have a common fast path and an uncommon slow path, and the cost of deciding which path to use is a significant fraction of the cost of the fast path + checking the result.
In the case of libm, there are a lot of code paths which are there only to maintain strict compliance; no numeric analyst would ever call exp or pow (or even round) with arguments in t
Re: (Score:2)
It seems it's the MPI functions only that are subject to this optimization, not the normal FP ones. My guess is that these aren't even used for soft floating point on FPU limited systems (since adding the exact size of the a FP type allows writing more efficient routines).
So the impact is very low I guess.
Re: (Score:2)
Re: (Score:3)
The slow paths are "several thousand times" slower, according to the article. You only need to hit them rarely to see a significant degradation of performance.
I was recently dealing with a code that spent most of its time in pow(). Some basic algebra significantly simplified the math, and I could do away with the call to the function, but this shows its performance is a real-life concern for some people.
Re: (Score:2)
When I was in high school we had a FORTRAN class and one of the assignments was print out as many Pythagorean triples as you can in the allowed 1 minute of run time. Most students would start with power and square root function which would provide about a page and a half of results of which one was wrong because of rounding errors. Going from A^2 to A*A would get you far more pages. The system had a multiply-accumulate function that worked very well so a few changes in a formula could double the number
Re: (Score:3)
It looks like the slowest paths of the transcendental functions were improved by a lot. But how often do these paths get used? The article doesn't say so the performance benefits may be insignificant.
They don't get used very often. I don't know how much though because that would require running the functions for the entire domain of inputs which would be years for the univariate functions and pretty much forever for pow. If you want anecdotal data then I have that: I didn't hit the slow path of log for the entire week I threw different inputs at it. pow hit quite a few in an hour of running but even that was about 1 in 1000 or less. However, since there is no pattern to determine which interval of i
Always room for improvement (Score:4, Interesting)
Newer hardware can make use of newer features which will change what should be considered the best optimisations. Addition used to be much faster than multiplication until they put barrel multipliers in chips. Once floating point cores were added, other things became faster but the early FPUs could do things like add and multiply and anything else could be very slow. I wrote a floating point library for OS9 for the radio shack color computer which had a 2 mhz 8 bit cpu with good 16 bit instructions and no floating point hardware and I could do trig and log functions faster than a 4.77 mhz 8087 floating point unit. I could use tricks like bit shifting and de-normalising floating point numbers for quick adds. There was one function that the typical Taylor series used a /3 + /5 + /7 type thing but there was an alternate that used /2 + /4 + /8 but took more steps but an integer CPU can divide by a power of 2 something like 50 times faster than dividing by an odd number so the doing the extra steps was faster. My library took advantage of precision shortcuts like simply not dealing with some of the low order bits when the precision was going to be lost in the next step or two which are things that you simply can't do efficiently with current floating point hardware.
Re: (Score:3, Insightful)
But having made an actual, real contribution to a piece of software in general use makes him more of an engineer than an unusually competent computer scientist.
Re: I'm surprised (Score:2, Informative)
Last time I checked, India was in Asia.
Re:Lookup tables are faster and more accurate (Score:5, Interesting)
What is perhaps a bit of irony of history, even for humans a lookup table is faster and more precise than manually calculating it via formula. That is why they published books of logarithms. Using interpolation you can even stretch out the precision to several more digits. With a table of values in memory you can also narrow down the inputs to Newton's method and calculate any differentiable function very quickly to an arbitrary precision. With some functions the linear approximation is so close that you can reduce it in just a few cycles.
Even in most trigonometric functions there is a simple table upon which the angle addition formulas are used to get the other values[an old example] [physic.ut.ee].
Given the size of most operating systems, where 8k of ram is hardly noticed (most gifs are larger than this), I am actually quite surprised that the lookup table method is not more used. It would seem one of the first things to put in cache on your ALU.
Re: (Score:3)
With a table of values in memory you can also narrow down the inputs to Newton's method and calculate any differentiable function very quickly to an arbitrary precision. With some functions the linear approximation is so close that you can reduce it in just a few cycles.
No, you can't. I know this was done in Quake3 fastInvSqrt(), but that is the exception, not the rule in my experience. x = pow(a,b) is a differentiable function. How can you assemble a root function/Newton iteration to successively correct an initial guess for x to arbitrary precision--without actually calling pow() or other transcendental function? I have built Newton (and Halley and Householder) iterations to successively correct estimates for pow(a,b) when b is a particular rational number. You can
Re: (Score:3)
For pow(a,b), [a,b real numbers], you are essentially calculating:
a^b = (e^log(a))^b) or pow(pow(e, log(a)), b) which is e^(b*log(a)) or pow(e, b*log(a)) where e is the base of the natural logarithm.
What you have in your table are the values for e^x and log(x), like any good book of logarithms of ancient times. Precision according to your needs. For quick lookup you can even index the mantissa in a b-tree if your table is huge.
Then it becomes very quick:
step 1: look up log (a) in the table, interpol
Re: (Score:2)
Here is a nice book [archive.org] that we had in our school days illustrating these sorts of techniques
Re: (Score:2)
And a nice pdf paper [berkeley.edu] illustrating this technique and its merits
Re: (Score:3)
But that was not my question. I fully understand how to use lookup tables/Chebyshev expansions of exp(x) and ln(x) to implement pow(x,a)--I have implemented these many times. My question was specifically on your assertion that any differentiable function could be evaluated with as a Newton-style iterative correction and thus provide arbitrarily precise results. I asked specifically to see how that is accomplished for pow(). There is no corrective mechanism in the algorithm you have stated above. The pr
Re: (Score:2)
Factorials tend to cancel out nicely with other factorials.
And, also, 69! factorial is too big for most calculators - even a pre-calculated table could probably do a ton to speed things alone.
Re: (Score:2)
You mean obligatory Walsh. Of course, it's obsolete now that RSQRTSS is ubiquitous.
Re: (Score:2)
Depends on what you're doing. Sending stuff to the GPU incurs overhead. If you're doing a ton of matrix operations all in a row, it may be worth it. If you are doing a few every once and a while, probably not.
Re: (Score:3)