etc. All of these representations of the number 123 are numerically equivalent. They differ only in their "

- 1.23 * 10
^{2}- 12.3 * 10
^{1}- 123 * 10
^{0}- .123 * 10
^{3}- 1230 * 10
^{-1}

Notice how the decimal point "floats" within the number as the exponent is changed.
This phenomenon gives floating point numbers their name. Only two of the representations of the number 123
above are in any kind of standard form. The first representation, 1.23 * 10^{2}, is in a form called
"scientific notation", and is distinguished by the normalization of the significand:

in scientific notation, the significand is always a number greater than or equal to 1 and less than 10.Standard computer normalization for floating point numbers follows the fourth form in the list above:

Of course, in a binary computer, all numbers are stored in base 2 instead of base 10; for this reason, the normalization of a binary floating point number simply requires that there be no leading zeroes after the binary point (just as the decimal point separates the 10the significand is greater than or equal to .1, and is always less than 1.

For this reason, we will discuss both the IEEE standards as well as the floating point formats implemented in
the very common Intel chips (such as the 80387, 80486 and the Pentium series). Each of these formats has a name
like "single precision" or "double precision", and specifies the numbers of bits
which are used to store both the
exponent and the significand. We will defined the notion of "**precision**" in the following way: if the
significand is stored in n bits, it can represent a decimal number between 0 and 2^{n} - 1 (since a
significand is stored as an unsigned integer). If we find the largest number "m"
such that 10^{m} - 1
is less than or equal to 2^{n} - 1, m will be the precision. Consider the following:

From the last example, it is easy to see that a 20 bit significand provides just over 6 decimal digits of precision. In the other examples, there is more precision than we have indicated. For example, a 16 bit significand is certainly sufficient to represent many decimal numbers with more than 4 digits; however, not all 5 digit decimal numbers can be represented in 16 bits, and so the precision of a 16 bit significand is said to be "> 4" (but less than 5). Some texts attempt to more accurately describe the precision using fractions, but we do not feel the need to do so.

2 ^{4}- 1 = 1510 ^{1}- 1 = 92 ^{8}- 1 = 25510 ^{2}- 1 = 992 ^{12}- 1 = 4,09510 ^{3}- 1 = 9992 ^{16}- 1 = 65,53510 ^{4}- 1 = 9,9992 ^{20}- 1 = 1,048,57510 ^{6}- 1 = 999,999

The following table describes the IEEE standard formats as well as those used in common Intel processors:

Note first that all of the formats reserve one bit to store the sign of the number; this is necessary because the significand is stored as an unsigned fraction in all of these formats (often the first bit of the significand is not even stored, because it is always 1 in a properly normalized floating point number). The rows describing the IEEE extended formats specify the minimum number of bits which the exponent and significand must have in order to satisfy the standard. The Intel "internal" format is an extended precision format used inside the CPU chip, which allows consecutive floating point operations to be performed with greater precision than that which will eventually be stored.

PrecisionSignExponentSignificandTotal LengthDecimal digits(# of bits)(# of bits)(# of bits)(in bits)of precisionIEEE / Intel single 1 8 23 32 > 6 IEEE single extended 1 >= 11 >= 32 >= 44 > 9 IEEE / Intel double 1 11 52 64 > 15 IEEE double extended 1 >= 15 >= 64 >= 80 > 19 Intel internal 1 15 64 80 > 19

Exponents are commonly stored in these formats as unsigned integers; however, an exponent can be negative as
well as positive, and so we must have some technique for representing negative exponents using unsigned integers.
This technique is called "biasing": a positive number is added to the exponent before it is stored in to the
floating point number. The stored exponent is then called a "**biased exponent**". If the exponent contains
8 bits, the bias number 127 is added to the exponent before it is stored so that, for example, an exponent of
1 is stored as 128. Since the unsigned exponent can represent numbers between 0 and 255, it should be theoretically
possible to store exponents whose values range from -127 to +128 (-127 would stored as the biased exponent value
0, and +128 would be stored as the biased value 255). In practice, the IEEE specification reserves
the values 0 and 255, which means that an 8 bit exponent can represent exponent values between -126 and +127.
If the stored (biased) exponent has the value 0, and the significand is 0 as well, the value of the floating point
number is exactly 0. A floating point number with a stored exponent of 0 and a nonzero significand is of
course unnormalized. If the stored exponent has the value 255 (all ones), the floating point number has one of
two special meanings:

In general, if n bits are used to store the exponent, the bias value is 2

- if the significand is 0, the number represents
infinity, and- if the significand is not zero, that number represents a "
NaN" ("Not a Number"): the result of a division by zero.

-2a biased exponent of 2^{n - 1}+ 2 to +2^{n - 1}- 1, and

2and can represent exponents between^{10}- 1 = 1,023,

-2Note that while all of our examples use the decimal number system (for your convenience), the computer uses binary as the base for the exponents as well (although in the past, some computers used 16 as the base for the exponents). So, for example, the number 1 has a normalized binary floating point value of^{10}+ 2 = -1,022 to +2^{10}- 1 = +1,023.

.1the number 3 has the normalized binary floating point value of_{2}* 2^{1}(with a 1 in the 2^{-1}place, this is equivalent to 1/2 * 2);

.11In contrast, of course, you would represent these numbers in decimal as_{2}* 2^{2}(with a 1 in the 2^{-1}place and a 1 in the 2^{-2}place, this is equivalent to (1/2 + 1/4) * 4), etc.

.1 * 10and^{1}

.3 * 10^{1}, respectively.

For our first example, consider the sum

122 + 12.We first normalize these numbers as

.122 * 10But already there is a complication: we can't simply add two decimal numbers which are multiplied by different exponents! That is, the answers^{3}and .12 * 10^{2}.

.242 * 10are obviously incorrect! To solve this problem, the number with the smaller exponent must be^{3}or .242 * 10^{2}

.12 * 10Now it is clear that we can simply add the decimal numbers, since^{2}becomes .012 * 10^{3}.

.134 * 10which is of course 134. If the relative sizes of the two numbers are too different, we may have one of two errors. As an example of the first type of error, consider the sum 1220 + 14. In our hypothetical computer, these numbers are normalized as^{3}

.122 * 10But when we denormalize the smaller number before adding, the fact that we have only 3 decimal digits of precision causes a truncation error:^{4}and .14 * 10^{2}.

.14 * 10that is, the second significant digit was lost: it was denormalized out of existence. In fact, if we consider the sum 1220 + 1.4, we see that the second operand (1.4) is denormalized to zero:^{2}becomes .001 * 10^{4};

.14 * 10This is called an "^{1}becomes .000 * 10^{4}!

.127 * 10because, again, we only have 3 decimal digits of precision!^{4},

Representation errors can also occur during multiplication. Consider the product 125 * 21. This product is represented in normalized form as

.125 * 10Now the convenience of exponential notation for multiplication has not been lost on the computer architects (this is why they choose exponential representations in the first place!). Since^{3}* .21 * 10^{2}.

the product is computed by us asx * 10,^{a}* y * 10^{b}= (x * y) * 10^{a + b}

(.125 * .21) * 10but because of the finite precision inherent in the computer (which here has only 3 digits of precision), the result is truncated (^{5}= .02625 * 10^{5}

.026 * 10and is then normalized to^{5}

.26 * 10In general, in order to perform any floating point arithmetic operation, the computer must:^{4}.

The step which actually performs the operation can result in another kind of error: overflows can occur in floating point arithmetic as well as in fixed, but they are detected in the exponent rather than the significand. If we attempt to multiply 2 * 10

- first represent each operand as a normalized number within the limits of its precision (which may result in representation error due to truncation of less significant digits);
- denormalize the smaller of the numbers if an addition or subtraction is being performed (which may again result in representation error due to the denormalization);
- perform the operation (which again may result in representation error due to the finite precision of the floating point processor); and finally
- renormalize the result.

.2 * 10but our imaginary computer can only represent numbers with exponents up to 15!^{8}* .7 * 10^{8}= .14 * 10^{16};

It is therefore useful to know the range of exponents which your computer can represent. As an example, the Intel double precision format supports exponents in the range -1,022 to +1,023. Any results with exponents outside this range will result in an overflow error.

Since these are exponents of 2, the range of numbers which can be represented as floating point doubles in an Intel CPU are (approximately) 2It is worth noting that floating point operations are much slower than their corresponding fixed point counterparts. For example, on a 1.4 GHz Pentium 4 CPU, two 32 bit fixed point numbers can be added in about 71 billionths of a second (71 nanoseconds). A fixed point multiply of two 32 bit numbers may take 4 or 5 times longer. By comparison, floating point operations may take tens or even hundreds of times longer to perform.^{-1,075}(with 53 unnormalized bits of significand) to 2^{1,024}(including the "unstored" bit of the significand). In decimal, this range is approximately 2.5 * 10^{-324}to 1.7 * 10^{308}.

Try 32768 and 4294967296 to see some of the odd conversions Java makes between integer types.The next section is devoted to the representation of characters in the binary computer.

Go to: Title Page Table of Contents Index

©2012, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.

Please send comments or suggestions to the author.