Floating Point Arithmetic

Floating point arithmetic derives its name from something that happens when you use exponential notation. Consider the number 123: it can be written using exponential notation as:

  1. 1.23 * 102
  2. 12.3 * 101
  3. 123 * 100
  4. .123 * 103
  5. 1230 * 10-1
etc. All of these representations of the number 123 are numerically equivalent. They differ only in their "normalization": where the decimal point appears in the first number. In each case, the number before the multiplication operator ("*") represents the significant figures in the number (which distinguish it from other numbers with the same normalization and exponent); we will call this number the "significand" (also called the "mantissa" in other texts, which call the exponent the "characteristic").

Notice how the decimal point "floats" within the number as the exponent is changed. This phenomenon gives floating point numbers their name. Only two of the representations of the number 123 above are in any kind of standard form. The first representation, 1.23 * 102, is in a form called "scientific notation", and is distinguished by the normalization of the significand:

in scientific notation, the significand is always a number greater than or equal to 1 and less than 10.
Standard computer normalization for floating point numbers follows the fourth form in the list above:
the significand is greater than or equal to .1, and is always less than 1.
Of course, in a
binary computer, all numbers are stored in base 2 instead of base 10; for this reason, the normalization of a binary floating point number simply requires that there be no leading zeroes after the binary point (just as the decimal point separates the 100 place from the 10-1 place, the binary point separates the 20 place from the 2-1 place). We will continue to use the decimal number system for our numerical examples, but the impact of the computer's use of the binary number system will be felt as we discuss the way those numbers are stored in the computer.

Floating Point Formats

Over the years, floating point formats in computers have not exactly been standardized. While the IEEE (Institute of Electrical and Electronics Engineers) has developed standards in this area, they have not been universally adopted. This is due in large part to the issue of "backwards compatibility": when a hardware manufacturer designs a new computer chip, they usually design it so that programs which ran on their old chips will continue to run in the same way on the new one. Since there was no standardization in floating point formats when the first floating point processing chips (often called "coprocessors" or "FPU"s: "Floating Point Units") were designed, there was no rush among computer designers to conform to the IEEE floating point standards (although the situation has improved with time).

For this reason, we will discuss both the IEEE standards as well as the floating point formats implemented in the very common Intel chips (such as the 80387, 80486 and the Pentium series). Each of these formats has a name like "single precision" or "double precision", and specifies the numbers of bits which are used to store both the exponent and the significand. We will defined the notion of "precision" in the following way: if the significand is stored in n bits, it can represent a decimal number between 0 and 2n - 1 (since a significand is stored as an unsigned integer). If we find the largest number "m" such that 10m - 1 is less than or equal to 2n - 1, m will be the precision. Consider the following:

24 - 1 = 15101 - 1 = 9
28 - 1 = 255102 - 1 = 99
212 - 1 = 4,095103 - 1 = 999
216 - 1 = 65,535104 - 1 = 9,999
220 - 1 = 1,048,575106 - 1 = 999,999
From the last example, it is easy to see that a 20 bit significand provides just over 6 decimal digits of precision. In the other examples, there is more precision than we have indicated. For example, a 16 bit significand is certainly sufficient to represent many decimal numbers with more than 4 digits; however, not all 5 digit decimal numbers can be represented in 16 bits, and so the precision of a 16 bit significand is said to be "> 4" (but less than 5). Some texts attempt to more accurately describe the precision using fractions, but we do not feel the need to do so.

The following table describes the IEEE standard formats as well as those used in common Intel processors:

PrecisionSignExponentSignificand Total LengthDecimal digits
(# of bits)(# of bits)(# of bits) (in bits)of precision
IEEE / Intel single182332 > 6
IEEE single extended1 >= 11 >= 32>= 44 > 9
IEEE / Intel double1115264 > 15
IEEE double extended1 >= 15 >= 64>= 80 > 19
Intel internal1156480 > 19
Note first that all of the formats reserve one bit to store the sign of the number; this is necessary because the significand is stored as an unsigned fraction in all of these formats (often the first bit of the significand is not even stored, because it is always 1 in a properly normalized floating point number). The rows describing the IEEE extended formats specify the minimum number of bits which the exponent and significand must have in order to satisfy the standard. The Intel "internal" format is an extended precision format used inside the CPU chip, which allows consecutive floating point operations to be performed with greater precision than that which will eventually be stored.

Exponents are commonly stored in these formats as unsigned integers; however, an exponent can be negative as well as positive, and so we must have some technique for representing negative exponents using unsigned integers. This technique is called "biasing": a positive number is added to the exponent before it is stored in to the floating point number. The stored exponent is then called a "biased exponent". If the exponent contains 8 bits, the bias number 127 is added to the exponent before it is stored so that, for example, an exponent of 1 is stored as 128. Since the unsigned exponent can represent numbers between 0 and 255, it should be theoretically possible to store exponents whose values range from -127 to +128 (-127 would stored as the biased exponent value 0, and +128 would be stored as the biased value 255). In practice, the IEEE specification reserves the values 0 and 255, which means that an 8 bit exponent can represent exponent values between -126 and +127. If the stored (biased) exponent has the value 0, and the significand is 0 as well, the value of the floating point number is exactly 0. A floating point number with a stored exponent of 0 and a nonzero significand is of course unnormalized. If the stored exponent has the value 255 (all ones), the floating point number has one of two special meanings:

In general, if n bits are used to store the exponent, the bias value is 2n - 1 - 1, the range of exponents which can be represented are from
-2n - 1 + 2 to +2n - 1 - 1, and
a biased exponent of 2n - 1 indicates either infinity or a NaN (as above). The Intel double precision floating point format, which has an 11 bit exponent field, uses a bias of
210 - 1 = 1,023,
and can represent exponents between
-210 + 2 = -1,022 to +210 - 1 = +1,023.
Note that while all of our examples use the decimal number system (for your convenience), the computer uses binary as the base for the exponents as well (although in the past, some computers used 16 as the base for the exponents). So, for example, the number 1 has a normalized binary floating point value of
.12 * 21 (with a 1 in the 2-1 place, this is equivalent to 1/2 * 2);
the number 3 has the normalized binary floating point value of
.112 * 22 (with a 1 in the 2-1 place and a 1 in the 2-2 place, this is equivalent to (1/2 + 1/4) * 4), etc.
In contrast, of course, you would represent these numbers in decimal as
.1 * 101
.3 * 101, respectively.


In order to illustrate some of the details of floating point arithmetic, we will consider an imaginary floating point format in which the exponent is stored in 5 bits, the significand is stored in 10 bits, and 1 bit is used to store the sign of the number. Using exponent biasing and reserving the values 0 and 31 (25 - 1), our bias value will be 15 and our exponent will therefore be able to represent the values -14 to 15. Since the significand is stored in 10 bits, and 210 - 1 = 1,023, we see that our imaginary format provides us with three decimal digits of precision (since all of the numbers from 0 to 999 fit in 10 bits, but not all those from 0 to 9,999 fit). We will do all of our examples using decimal, but always keep in mind that the computer always uses binary!

For our first example, consider the sum

122 + 12.
We first normalize these numbers as
.122 * 103 and .12 * 102.
But already there is a complication: we can't simply add two decimal numbers which are multiplied by different exponents! That is, the answers
.242 * 103 or .242 * 102
are obviously incorrect! To solve this problem, the number with the smaller exponent must be denormalized before the addition can take place:
.12 * 102 becomes .012 * 103.
Now it is clear that we can simply add the decimal numbers, since a * 10x + b * 10x = (a + b) * 10x, and we get the answer
.134 * 103
which is of course 134. If the relative sizes of the two numbers are too different, we may have one of two errors. As an example of the first type of error, consider the sum 1220 + 14. In our hypothetical computer, these numbers are normalized as
.122 * 104 and .14 * 102.
But when we denormalize the smaller number before adding, the fact that we have only 3 decimal digits of precision causes a
truncation error:
.14 * 102 becomes .001 * 104;
that is, the second significant digit was lost: it was denormalized out of existence. In fact, if we consider the sum 1220 + 1.4, we see that the second operand (1.4) is denormalized to zero:
.14 * 101 becomes .000 * 104!
This is called an "underflow" error; it and the truncation errors are called "representation errors". You are already familiar with representation errors: some numbers have no finite representation in the decimal number system, such as 1/3 (which cannot be written down as a finite string of numbers to the right of the decimal point). For the same reasons, many numbers have no finite representation in binary. This includes all so-called "non-terminating" numbers in decimal, as well as any fraction with a power of 5 in the denominator (which may have finite representations in decimal; ie., 1/5 = .2). And of course there is representation error any time you need more precision than the computer provides: 1273 is represented on our hypothetical computer as
.127 * 104,
because, again, we only have 3 decimal digits of precision!

Representation errors can also occur during multiplication. Consider the product 125 * 21. This product is represented in normalized form as

.125 * 103 * .21 * 102.
Now the convenience of exponential notation for multiplication has not been lost on the computer architects (this is why they choose exponential representations in the first place!). Since
x * 10a * y * 10b = (x * y) * 10a + b,
the product is computed by us as
(.125 * .21) * 105 = .02625 * 105
but because of the finite precision inherent in the computer (which here has only 3 digits of precision), the result is truncated (before normalization!) to
.026 * 105
and is then normalized to
.26 * 104.
In general, in order to perform any floating point arithmetic operation, the computer must:
  1. first represent each operand as a normalized number within the limits of its precision (which may result in representation error due to truncation of less significant digits);
  2. denormalize the smaller of the numbers if an addition or subtraction is being performed (which may again result in representation error due to the denormalization);
  3. perform the operation (which again may result in representation error due to the finite precision of the floating point processor); and finally
  4. renormalize the result.
The step which actually performs the operation can result in another kind of error:
overflows can occur in floating point arithmetic as well as in fixed, but they are detected in the exponent rather than the significand. If we attempt to multiply 2 * 107 times 7 * 107, the normalized product is
.2 * 108 * .7 * 108 = .14 * 1016;
but our imaginary computer can only represent numbers with exponents up to 15!

It is therefore useful to know the range of exponents which your computer can represent. As an example, the Intel double precision format supports exponents in the range -1,022 to +1,023. Any results with exponents outside this range will result in an overflow error.

Since these are exponents of 2, the range of numbers which can be represented as floating point doubles in an Intel CPU are (approximately) 2-1,075 (with 53 unnormalized bits of significand) to 21,024 (including the "unstored" bit of the significand). In decimal, this range is approximately 2.5 * 10-324 to 1.7 * 10308.
It is worth noting that floating point operations are much slower than their corresponding fixed point counterparts. For example, on a 1.4 GHz Pentium 4 CPU, two 32 bit fixed point numbers can be added in about 71 billionths of a second (71 nanoseconds). A fixed point multiply of two 32 bit numbers may take 4 or 5 times longer. By comparison, floating point operations may take tens or even hundreds of times longer to perform.

The following Java applet allows you to see the result of casting a Java double value into a different Java numerical type. You can perform several functions on the number to observe the effects of representation error:

You need a Java-capable browser to be able to work with this applet.

Note that the "hex" button separates the significand (the portion before "p") from the power of two (after the "p").

Try 32768 and 4294967296 to see some of the odd conversions Java makes between integer types.
The next section is devoted to the representation of characters in the binary computer.

Go to:Title PageTable of ContentsIndex

©2012, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.

Please send comments or suggestions to the author.