So far in this chapter we have been concerned with computing "snapshot" quantities: numbers that describe a single device, process or file. But it is often vitally important to give meaning to groups of numbers.
In the last section, we illustrated how multimedia applications require large capacity disk drives for data storage. Anyone working with digital photographs, music or video must be able to predict how much disk space will be required for any given application. Suppose you have taken a dozen pictures with a 3 megapixel digital camera (at high quality setting), and downloaded them to your computer. A list of the files, with their sizes in bytes, is:
Suppose you are preparing to go on vacation; you will be gone for three weeks, and anticipate taking 30 to 40 pictures each day. How much storage space will you need to take with you?806912 IMG_0760.jpg 868352 IMG_0763.jpg 774144 IMG_0766.jpg 815104 IMG_0769.jpg 626688 IMG_0761.jpg 872448 IMG_0764.jpg 806912 IMG_0767.jpg 831488 IMG_0770.jpg 634880 IMG_0762.jpg 884736 IMG_0765.jpg 835584 IMG_0768.jpg 798720 IMG_0771.jpg
We can characterize a group of numbers by their mean, or average. This is simply the sum of their values divided by how many values there are. In this example, the mean file size is
(806912 + 626688 + 634880 + 868352 + 872448 + 884736 + 774144 + 806912 + 835584 + 815104 + 831488 + 798720) / 12which we round up to 796331 bytes (since there must be a whole number of bytes). Now we can predict that for 40 pictures per day each day for 21 days, we will need= 796330.6667 bytes
796331 * 40 * 21 = 668918040 bytesor approximately 638 MB of storage. Note that none of the files had a size that was equal to the average, but that if we multiply the average by the number of pictures, we will get a pretty good idea of how much storage we need.
How good an idea is a function of how close our individual data items (the 12 file sizes) are to the average. But how do we quantify that, so that we will know something about how good our estimates are? One way is to compute the range of the data in terms of the mean. The lowest data value was 626688; the highest was 884736. The difference between the mean and the lowest is
796331 - 626688 = 169643 byteswhile the difference between the highest and the mean is
884736 - 796331 = 88405 bytes.The larger of these differences is called the maximum absolute deviation. The range is then expressed as
796331 ± 169643 bytes,that is, the mean plus or minus the maximum absolute deviation, since all of the data items fall within this range.
We can express the range as a percentage of the mean by using the maximum percentage deviation, which is equal to the maximum absolute deviation divided by the mean, times 100%:
(169643 / 796331) * 100%so that we can write the range as 796331 ± 21.3 %. Note that the maximum percentage deviation has no units: we divided the absolute deviation by the mean, so the units canceled. This result indicates that our estimate of how much storage space we will need should not be off by much more than 20 %, and so a single 1 GB flash card for our camera should be plenty.= 21.303 %
The most important of these distributions is the Normal or Gaussian Distribution. It describes data which is randomly distributed about the mean, and is familiar to most students as the "bell curve". Here is the normal distribution corresponding to our data (scaled by a factor of 1000 for readability):
It is symmetric about the mean, and the "ends" asymptotically approach zero. It describes the probability with which any given random value will be included in the data set: values near the mean are much more probable than values far away. The width of the curve is described using the standard deviation, and it is useful to know that approximately 68.27%, or just over two thirds, of the data items will fall in the range described by the mean plus or minus one standard deviation (within the red lines above). Thus the standard deviation gives us a convenient measure of the "spread" of a set of random data items. It can be computed using the formula
σ = ( Σ ( x_{i} - μ )^{2} / n )^{1/2}where
For the file sizes in our example, the standard deviation is then
( ( (806912 - 796331)^{2} + (626688 - 796331)^{2} + (634880 - 796331)^{2} + (868352 - 796331)^{2} + (872448 - 796331)^{2} + (884736 - 796331)^{2} + (774144 - 796331)^{2} + (806912 - 796331)^{2} + (835584 - 796331)^{2} + (815104 - 796331)^{2} + (831488 - 796331)^{2} + (798720 - 796331)^{2} ) / 12 )^{1/2}This means that our range was significantly broader than the width of our normal distribution: our estimate of how much storage we need is probably better than we at first thought. We can verify this by examining a histogram of our data: we chose a range encompassing all of our data (600000 to 900000), split it into "bins" (in this case, 15 bins of 20000 each), and plot the number of data items in each bin:= 80359.9 bytesNote that the standard deviation, like the maximum absolute deviation, has the same units as the data items.
We can see from this that most of our data is clustered together. The two files which had much smaller sizes were images with less detail than the others. This is a characteristic of JPEG (Joint Picture Experts Group) files: as a lossy compression method, the sizes of the files it generates are closely related to the amount of detail in the original pictures.
- 620000 to 639999 - 2 data points (626688 and 634880)
- 760000 to 779999 - 1 data point (774144)
- 780000 to 799999 - 1 data point (798720)
- 800000 to 819999 - 3 data points (806912, 806912 and 815104)
- 820000 to 839999 - 2 data points (831488 and 835584)
- 860000 to 879999 - 2 data points (868352 and 872448)
- 880000 to 899999 - 1 data point (884736)
Our histogram indicates that a Gaussian is really a poor model for our data. Can we do better using a larger data set? Let us define the Gaussian Distribution as
G(x) = M e^{- (x - μ)2 / (2 σ2)}.The distribution will then have a maximum value M at the mean μ, with standard deviation σ. We will find that it is a much more effective model if we use a larger sample size. This spreadsheet has three sheets:
In addition, there is a graph showing
From the graph, it is obvious that the Gaussian is a poor fit to the data; the population is made up of more than one type of file, but how many distinct sets are there?
The Gaussian seems to be a reasonable model for this set of data, although it is clear that the set of file sizes is not a truly random sample.
The Gaussian function plotted in red is
G(x) = 1150 e^{- (x - 300000)2 / (2 * 1000002)} + 34 e^{- (x - 2100000)2 / (2 * 2500002)} + 85 e^{- (x - 4250000)2 / (2 * 5000002)}.These three sub-populations roughly correspond to 704x480 pixel images designed for web use, and the images taken with 10 Megapixel and 14 Megapixel cameras, respectively. Of course, as we noted above, there is significant noise in the data set, which we can see because the Gaussians are only approximate fits to the data.
It is clear from these examples that the population size, as well as the resolution of the binning, have a significant impact on the reliability of the model distribution.
We will now conclude this chapter, and the text, with a discussion of Computer Performance Modeling.
Go to: Title Page Table of Contents Index
©2012, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.
Please send comments or suggestions to the author.