As you might have guessed, most of my posts are a synthesis of two or more ideas. This time, they are IEEE-754 floating point formats and composite arithmetic formats. To provide some overview, IEEE-754 has been the official standard for representing numbers in computers for over 25 years, and is widely deployed in practically every computer architecture. Composite arithmetic on the other hand is a new approach which uses a combination of integer, rational, scientific, and tetrational representation in unison to achieve greater accuracy, precision, and dynamic range.
References for composite arithmetic include: "Beyond Floating Point" (by C.W.Clenshaw and F.W.J.Olver) which focuses on extending floating point with tetration alone, "Composite Arithmetic, a Proposal for a New Standard" and "Composite Arithmetic, A Storage Form" (both by W. Neville Holmes) which both focus on combining all four approaches, "Design of a Composite Arithmetic Unit for Rational Numbers" (by Tomasz Pinkiewicz, W. Neville Holmes, and Tariq Jamil) which focuses on the implementation of these in hardware, "Design of a 32-bit Arithmetic Unit based on Composite Arithmetic and its Implementation on a Field Programmable Gate Array." (by Tomasz Hubert Pinkiewicz) which focuses on the same, and "Lecture notes on: Computer Arithmetic: Principles, Architectures, and VLSI Design" (by Reto Zimmermann) which seems to be an overview of the subject targeted at students.
The most recent edition of IEEE-754 introduced a 16-bit floating point format, called 'binary16' for short. Since this is the smallest standard floating point format, we will use it as an example of how IEEE-754 represents numbers. To begin we will see how it represents 1.5:
s | exponent | significand | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bin | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
hex | 3 | E | 0 | 0 | ||||||||||||
value | = 1.5 |
and this is how the standard binary16 format represents infinity:
s | exponent | significand | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bin | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
hex | 7 | C | 0 | 0 | ||||||||||||
value | = Infinity |
As you can see, binary16 uses a 5-bit exponent and a 10-bit significand. One important class of representations which do not represent numbers is known collectively as Not-A-Number (NaN) representations. These are when the exponent is all 1's, regardless of the sign. Infinity is a particular example of this, since the exponent is all 1's and the significand is all 0's. If the significand is nonzero, then the remaining representations fall into two categories: signaling NaNs (sNaN) and quiet NaNs (qNaN). A signaling NaN is traditionally represented with a 0 in the MSB of the significand, and some other bit nonzero. A quiet NaN is traditionally represented with a 1 in the MSB of the significand, with anything in the other bits. Since sNaNs can cause errors and trap handlers to invoke, it is a bad idea to use them to represent numbers, since many architectures will automatically convert sNaNs to qNaNs. So, we will leave sNaNs alone, and replace qNaNs with a tetrational number representation format.
This tetrational format will use the two MSBs to represent quietness (q) and height (h) respectively. The tetrand will become the exponent at the top of the exponential tower. So a height of 0 will indicate a value of 2^2^2^2^2^(0.tetrand), and a height of 1 will indicate a value of 2^2^2^2^2^2^(0.tetrand). In general, the value associated with the tetrational format is:
We will start at 65536, which is just greater than the largest value represented by binary16 (65504). Extending this to larger formats (like binary32) should probably start here as well, because the next height is too much larger than the largest number represented by binary32. Starting from 65536, we can represent it as follows:
binary16: | s | exponent | significand | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tetra16: | s | exponent | q | h | tetrand | |||||||||||||
bin | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
hex | 7 | E | 0 | 0 | ||||||||||||||
value | ≈ 65536 |
We get this value because 2^2^2^2^2^(0.0) = 2^2^2^2^1 = 2^2^2^2 = 2^2^4 = 2^16 = 65536. The next greater representation is
binary16: | s | exponent | significand | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tetra16: | s | exponent | q | h | tetrand | |||||||||||||
bin | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ||
hex | 7 | E | 0 | 1 | ||||||||||||||
value | ≈ 71036 |
and perhaps another interesting representation is that of googolplex:
binary16: | s | exponent | significand | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tetra16: | s | exponent | q | h | tetrand | |||||||||||||
bin | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | ||
hex | 7 | F | B | 2 | ||||||||||||||
value | ≈ googolplex = 10^(10^100) |
Note that we only used a single bit for the height of the exponential tower, but we could have used more. I think that 2 bits should be used for height in binary32, and 3 bits should be used for height in binary64 and binary128. All floating point formats would encode the height and tetrand in the extra bits of a qNaN so as to avoid conflicts with hardware arithmetic.
Here is an overview of the extensions to the binary16 format:
#x0000 = +0
#x0001 = 5.960464477539063e-8
... (standard binary16 floating-point) ...
#x7BFF = 65504.
#x7C00 = +Infinity
#x7C01 = +sNaN
... (standard binary16 signaling NaN) ...
#x7DFF = +sNaN
#x7E00 = 65536 = 2^2^2^2^2^(0)
#x7E01 = 71036.
... (new tetrational-point) ...
#x7E?? = 2^2^2^2^2^(0.??_16)
...
#x7EFE = 8.9423389054e+15721
#x7EFF = 6.2622606021e+17594
#x7F00 = 2^65536 = 2^2^2^2^2^2^(0) = 2^2^2^2^2
#x7F01 = 6.8504792165e+21383
... (new tetrational-point) ...
#x7F?? = 2^2^2^2^2^2^(0.??_16)
...
#x7FFE = 10^(2.69191e+15721) which is < 2^(2^65536)
#x7FFF = +qNaN (unique)
#x8000 = -0
#x8001 = -5.960464477539063e-8
... (standard binary16 floating-point) ...
#xFBFF = -65504.
#xFC00 = -Infinity
#xFC01 = -sNaN
... (standard binary16 signaling NaN) ...
#xFDFF = -sNaN
#xFE00 = -65536
#xFE01 = -71036.
... (new tetrational-point) ...
#xFE?? = -2^2^2^2^2^(0.??_16)
...
#xFEFE = -8.9423389054e+15721
#xFEFF = -6.2622606021e+17594
#xFF00 = -2^65536
#xFF01 = -6.8504792165e+21383
... (new tetrational-point) ...
#xFF?? = -2^2^2^2^2^2^(0.??_16)
...
#xFFFE = -10^(2.69191e+15721) which is > -2^(2^65536)
#xFFFF = -qNaN (unique)
In conclusion, we have shown exactly how to increase the dynamic range (quite dramatically) of standard IEEE-754 floating point formats by using a tetrational representation encoded in the unused bits of quiet NaNs. This provides a backwards-compatible hardware-accelerated tetrational floating point format which will encode approximate overflows as precisely as possible. So instead of overflows, you get a number! This makes it possible to be more informative when it comes to mathematical functions such as exp, log, or 1/x. With a bigger dynamic range, many of the outputs of these functions will no longer be indeterminate, but instead, finite.