C++ floating point big number arithmetic (January 2016)

C++ Floating Point Big Number Arithmetic

This is a fixed-precision floating point library. How much precision is a compiletime constant. Its purpose is to be more accurate and less of an appeal to authority than doubles and math.h. It does not load any values (that would be cheating, not to mention a problem when the precision changes), but it does precompute and cache a number of values, such as e and pi.

My motive for implementing big number arithmetic was to get the correct constants for symmetric multistep methods for n-body gravitational simulation. Doubles only took me so far. It turns out that implementing a big number library is a waste of time, because gcc comes with a floating point big number library, mpfr, which does all this stuff well already. But on the other hand, that requires setting up some libraries that it isn't clear how to do. g++ installed them on my computer, but i686-pc-cygwin-g++ doesn't compile even hello-world programs, and an hour of googling and experimenting did not bring joy. If I supply plain .h and .cpp files that do the job with no external dependencies (like assembly or math.h or mpfr.h), that avoids configuration issues. And anyhow, bignum arithmetic isn't all that big. If I've done it myself I'm depending on fewer black boxes. And maybe I'm learning something useful in the process.

How to use this library: download base.h, bigfloat.h, and bigfloat.cpp into your source tree somewhere. An example of using them: multistep.cpp #includes "bigfloat.h", and is compiled by "g++ -O3 bigfloat.cpp multistep.cpp -o multistep". You should probably replace base.h with your existing primitives and modify bigfloat.h, bigfloat.cpp to use your existing primitives. base.h, bigfloat.h, bigfloat.cpp (and the example multistep.cpp) are public domain, no strings attached, no warranty implied. You can test it by "g++ -g -DBIGFLOAT_TEST bigfloat.cpp -o bigfloat".

Datatype

The datatype is a fixed array of 32-bit digits, a 64-bit exponent, a 1-byte length, and a boolean sign bit. It's not compact. The exponent is base 2³², that is, one per digit. The digits are not initialized beyond the length, which makes certain operations faster (for example 1*1 requires 1 multiplication instead of n*n). Some operation use 16-bit digits instead to do calculations, but they all pack the answer back into 32-bit digits when they are done. The top digit is always nonzero, except for 0, which has no digits at all, and is positive and has a smaller exponent than anything else.

It would be better if the exponent were base 2 (like in mpfr) rather than base 2³². It would be better if the numbers near 0 followed the IEEE standard, with the top digits 0, so the smallest representable numbers don't have a big gap right around 0. I reserve the right to change my code incompatibly in the future. Make a copy of it if you want something that doesn't change.

Testing

There are tons of special cases. The way to catch all those special cases is to exhaustively test all inputs. The way to exhaustively test all inputs to a bignum package is to reduce the data type: when testing, use just 4 2-bit digits with a range of exponents just a little more than double the number of digits, rather than n 32-bit digits with all 64-bit signed values as exponents. This preserves all the edge cases, including having some middle values, yet makes it possible to enumerate all possible values. If you have lots of time to burn you can test all possible values with 3 4-bit digits too (I made assumptions that the digits were a power-of-two number of bits).

Addition and Subtraction

Addition and subtraction of floating point numbers is one of the hardest little problems in computer science. It has unbelievably many conditions and special cases. The examples below assume base-10 digits.

Addition and subtraction are handled by one routine, because 0.1 + -1 requires the same code as 0.1 - 1.
Whether it is treated as addition or subtraction (and whether the result is positive or negative) is determined up front, and requires sorting the arguments based on their absolute value.
There's no getting around that zero is a special case. Otherwise 100-100 = 0e2 and 1-1 = 0e0 and 100-100 ≠ 1-1. Similarly, you have to normalize the representations of 1.000 and 1 to be the the same, otherwise comparisons don't work properly.
For the case of x = x - y, either you make a full copy of x up front, or the logic gets even uglier.
There may be leading digits from the larger of the inputs, and trailing digits from either of the numbers. There may even be a gap with no digits from either number, for example 1e2 - 1 has no digits in the 10's place from either number.
The intermediate result may require more digits than the answer. Is it a bounded number of digits more? Yes. You might not think so, consider 1003 - 0.5001 to 4 digits precision, which should round to 1002 instead of 1003. But if the second argument is any bigger than .5, no matter how little more, it should round down not up. You only really need a boolean to express that, not the full .0001 .
Limited digits for the intermediate result makes it tempting to round multiple times. Don't do it, double rounding produces errors. Consider 95 + 9.5 to 2 digits precision (first round 9.5 to 10 because 1's and 10's look like the right digit, then 95+10=105, which gets rounded to 110, which is wrong).
Here's another rounding problem: 100 - 0.1, to 3 decimal places. If you decide it's got to be an integer up front, 0.1 rounds to 0 and you get an answer of 100. But the real answer is 99.9.

Multiplication

I still used 32-bit digits for this, but I produced a 64-bit results and added the bottom 32 bits to one digit of the result and the top 32 bits to the next digit. I wasn't trying to be terribly efficient; I calculate the full 2*digit intermediate result.

If timing becomes an issue, I'll implement Karatsuba multiplication. This is much faster than the O(nn) approach I currently use. Preparing for Karatsuba multiplication is what finally convinced me to have variable length floating point rather than fixed length floating point, because that algorithm is only fast if the lengths are variable. Karatsuba works like this: you know (10a + b) * (10c + d) = 100ac + 10(ad + bc) + bd. Well, (a + b) * (c + d) = (ac + ad + bd + bd). You ahve to calculate ac and bd for the full result anyhow. So, you could do 3 multiplication instead of 4 by saying the result is 100ac + 10((a+c)(b+d)-ac-bd) + bd. Notice that ad+bc might overflow a 64-bit number. I haven't crossed that bridge yet.

Division and Invert

Division: I split each 32-bit digit into two 16-bit digits. If I used 32-bit digits, then to estimate the next digit I would need to divide an numerator with at least 2 digits precision by a denominator with at least 1 digit precision. That means at least a 64-bit numerator. That's challenging: if the denominator is 1, that isn't 32 bits of precision, that's 1 bit of precision. Shifting in 31 extra bits of denominator means shifting in 31 extra bits to the numerator, which goes over 64 bits, doesn't fit. Also I allow signed differences, so I really only have 63 bit integers to play with not 64. To get around this, I split each 32-bit digit into two 16-bit digits. Then my denominator can have at least 1 digit precision (1 to 2), and the numerator can have at least 2 digits precision (2 to 3 if the digits were all in range, maybe up to 4 if there were some unhandled carries). 2*16=32 bits, 3*16=48 bits, these fit comfortably in a 63-bit integer.

Estimating the next digit could be worked around by using doubles instead, with the numerator 2³² times bigger than the denominator, then it only needs 32 bits of precision. But multiplying the estimated digit by the digits of the remainder will overflow 63 bits again, especially if there are some unhandled carries in the remainder. Maybe 32-bit digits could be made to work, but it'd be a tricky dance, so I kept to 16-bit digits.

Division can be carried on for infinitely many digits. For the scaled-down datatype and all possible numerators and divisors, an extra 2 half-digits was not good enough to have (numerator - divisor*quotient)/numerator with an exponent of at most 1-maxPrecision. I got that by trying all possible inputs for a variety of scaled-down datatypes, not by a proof.

Square Root

You are given x², and need to find x. Again, split each 32-bit digit into two 16-bit digits. Collect the first 4 half-digits in an integer and use Newton's method to figure out the first 2 estimated digits of the square root.

After that, you want to reduce the error of your guess. You have the square x² and a guess at the answer, (x+d). Now, x²-(x+d)² = -2xd - d². So (remainder)/(-2*old_guess) = (-2xd - d²)/(-2*(x+d)) = d - d²/(2(x+d)) is an estimate of d that will bring us closer to the right answer, provided |d²/(2(x+d))| is smaller than |d|. For example, if your current estimate of sqrt(2) is 1, then sqrt(2)² - (sqrt(2)-.4142...)² = 2 - 1 = 1, and 1/(-2*1) = -.5 is an estimate of d. It's a reasonable estimate, see d²/2(x+d) = (-.5)²/(2*1) = .125 , and |.125| < |-.5| .

Suppose you estimate your error as d but your guess is really x+d+e. Hopefully with |e| much smaller than |d|. Then (x+d+e)² = x² + d² + e² + 2xd + 2xe + 2de = (x+e)² + 2xd + 2de + d² = (x+e)² + 2d(x+d+e) - d². So, to replace x² - (x+d+e)² with x² - (x+e)², you need to add 2d(x+d+e) and subtract d². When your old estimate is low, d will be negative, so 2d(old_estimate) and -d² have the same sign.

For example, if I estimate sqrt(2) = 1, and the next guessed delta is -.5, then the old remainder is x² - (x+d+e)² = 2-1 = 1 = sqrt(2)² - (sqrt(2) + -.5 + .08578...)². To get to a new remainder of -.25 = 2-2.25 = sqrt(2)² - (1.5)², starting from an old remainder of 1 = 2 - 1², I need to add 2*(-.5)*(1) and subtract (-.5)² from the old remainder, that is, the new remainder is 1 - 1 - .25 = -.25 .

We could apply this by bignum Newton's method again, doubling the number of correct digits each round. Or we can increase the estimate just one digit at a time, but using 8-byte integer arithmetic, which is what bignum multiplication and division do internally. Which is faster? The faster way is doing it one digit at a time with integer arithmetic. It's cheaper than a single bignum invert, so it's clearly cheaper than log(n) bignum inverts.

Sin, Cos, Tan, Exp, Ln, π, e

There's a Brent-Salamin method that doubles the number of digits of accuracy for π . It requires a square root function, that was what clinched that I had to implement square root. Several others (e, exp, sin, cos, ln(1+x), atan) have reasonable Taylor series near zero. Others (tan, sec, arccos, power) are simple combinations of other functions.

A vital trick is to only use the Taylor series very close to zero, because it converges quickly then, and to splice together those results in various ways to form the full functions. Beware that making the segments smaller causes slight discontinuities at the additional splice points.

Another useful trick is to cache constants. For example, 1/n! is the same value regardless of x, so you could make a static array of the n values 1/0! .. 1/(n-1)!, where n is how many terms were needed to compute e. A common Init() method can precalculate all the constants needed for all of these routines.

Another useful trick is to use integer arithmetic instead of bignum arithmetic whenever possible. For example if you have to increase n! to (n+2)!, don't multiply a BigFloat by n+1 then multiply it again by n+2, instead calculate the integer (n+1)*(n+2) then multiply by that. Another example, if you want to find xⁿ/n! - xⁿ⁺²/(n+2)!, do ((n+1)*(n+2) - x²) / (n+2)! . Converting small integers to BigFloat is faster than looking up cached versions of the same thing.

For cos(-π/4..π/4) and sin(-π/4..π/4), |x| stays less than .8, and the coefficients get small quickly, so the Taylor converges reasonably fast. cos(x=π/4..π/2) is just sin(π/2-x) and vice versa, and all other ranges are also copies of those curves, you just need an accurate modulo π/4 . There are some identities that seem like they could get the same results but with x closer to zero: cos(4x) = 1 - 8sin²(x) + 8sin⁴(x), and sin(3x) = 3sin(x) - 4sin³(x). I don't know if it's better to make use of those identities or compute values directly.