Introduction to Descriptive Statistics

So far in this chapter we have been concerned with computing "snapshot" quantities: numbers that describe a single device, process or file. But it is often vitally important to give meaning to groups of numbers.

In the last section, we illustrated how multimedia applications require large capacity disk drives for data storage. Anyone working with digital photographs, music or video must be able to predict how much disk space will be required for any given application. Suppose you have taken a dozen pictures with a 3 megapixel digital camera (at high quality setting), and downloaded them to your computer. A list of the files, with their sizes in bytes, is:

806912 IMG_0760.jpg  868352 IMG_0763.jpg  774144 IMG_0766.jpg  815104 IMG_0769.jpg
626688 IMG_0761.jpg  872448 IMG_0764.jpg  806912 IMG_0767.jpg  831488 IMG_0770.jpg
634880 IMG_0762.jpg  884736 IMG_0765.jpg  835584 IMG_0768.jpg  798720 IMG_0771.jpg
Suppose you are preparing to go on vacation; you will be gone for three weeks, and anticipate taking 30 to 40 pictures each day. How much storage space will you need to take with you?

We can characterize a group of numbers by their mean, or average. This is simply the sum of their values divided by how many values there are. In this example, the mean file size is

(806912 + 626688 + 634880 + 868352 + 872448 + 884736 + 774144 + 806912 + 835584 + 815104 + 831488 + 798720) / 12
= 796330.6667 bytes
which we round up to 796331 bytes (since there must be a whole number of bytes). Now we can predict that for 40 pictures per day each day for 21 days, we will need
796331 * 40 * 21 = 668918040 bytes
or approximately 638 MB of storage. Note that none of the files had a size that was equal to the average, but that if we multiply the average by the number of pictures, we will get a pretty good idea of how much storage we need.

How good an idea is a function of how close our individual data items (the 12 file sizes) are to the average. But how do we quantify that, so that we will know something about how good our estimates are? One way is to compute the range of the data in terms of the mean. The lowest data value was 626688; the highest was 884736. The difference between the mean and the lowest is

796331 - 626688 = 169643 bytes
while the difference between the highest and the mean is
884736 - 796331 = 88405 bytes.
The larger of these differences is called the maximum absolute deviation. The range is then expressed as
796331 ± 169643 bytes,
that is, the mean plus or minus the maximum absolute deviation, since all of the data items fall within this range.

We can express the range as a percentage of the mean by using the maximum percentage deviation, which is equal to the maximum absolute deviation divided by the mean, times 100%:

(169643 / 796331) * 100%
= 21.303 %
so that we can write the range as 796331 ± 21.3 %. Note that the maximum percentage deviation has no units: we divided the absolute deviation by the mean, so the units canceled. This result indicates that our estimate of how much storage space we will need should not be off by much more than 20 %, and so a single 1 GB flash card for our camera should be plenty.

Distributions

There are several well-known functions which can help us to organize our data and make additional predictions. These functions are called distributions because we will use them as models of how various kinds of data are distributed.

The most important of these distributions is the Normal or Gaussian Distribution. It describes data which is randomly distributed about the mean, and is familiar to most students as the "bell curve". Here is the normal distribution corresponding to our data (scaled by a factor of 1000 for readability):

It is symmetric about the mean, and the "ends" asymptotically approach zero. It describes the probability with which any given random value will be included in the data set: values near the mean are much more probable than values far away. The width of the curve is described using the standard deviation, and it is useful to know that approximately 68.27%, or just over two thirds, of the data items will fall in the range described by the mean plus or minus one standard deviation (within the red lines above). Thus the standard deviation gives us a convenient measure of the "spread" of a set of random data items. It can be computed using the formula

σ = ( Σ ( xi - μ )2 / n )1/2
where

For the file sizes in our example, the standard deviation is then

( ( (806912 - 796331)2 + (626688 - 796331)2 + (634880 - 796331)2 + (868352 - 796331)2 + (872448 - 796331)2 + (884736 - 796331)2 + (774144 - 796331)2 + (806912 - 796331)2 + (835584 - 796331)2 + (815104 - 796331)2 + (831488 - 796331)2 + (798720 - 796331)2 ) / 12 )1/2
= 80359.9 bytes
Note that the standard deviation, like the maximum absolute deviation, has the same units as the data items.
This means that our range was significantly broader than the width of our normal distribution: our estimate of how much storage we need is probably better than we at first thought. We can verify this by examining a histogram of our data: we chose a range encompassing all of our data (600000 to 900000), split it into "bins" (in this case, 15 bins of 20000 each), and plot the number of data items in each bin:

We can see from this that most of our data is clustered together. The two files which had much smaller sizes were images with less detail than the others. This is a characteristic of JPEG (Joint Picture Experts Group) files: as a lossy compression method, the sizes of the files it generates are closely related to the amount of detail in the original pictures.

Our histogram indicates that a Gaussian is really a poor model for our data. Can we do better using a larger data set? Let us define the Gaussian Distribution as

G(x) = M e- (x - μ)2 / (2 σ2).
The distribution will then have a maximum value M at the mean μ, with standard deviation σ. We will find that it is a much more effective model if we use a larger sample size. This spreadsheet has three sheets:

  1. "1% Data Set", which contains a random sample of JPEG file sizes from the author's collection. The entire population consists of 6342 files; this sheet contains the sizes of 63 of those. It is organized by column:

    1. the file sizes;
    2. some labels;
    3. the maximum, mean and standard deviation of the sample sizes;
    4. a list of the upper limit of each of 100 bins (the first is for files ≤ 50,000 bytes long, the second is for files with sizes > 50,000 and ≤ 100,000 bytes, etc.);
    5. the frequency data (how many of the sizes in column A fit in each bin); and
    6. a Gaussian distribution with the parameters from column C.

    In addition, there is a graph showing

    From the graph, it is obvious that the Gaussian is a poor fit to the data; the population is made up of more than one type of file, but how many distinct sets are there?

  2. "Small Population", which contains the sizes of the 4020 files from the full data set with sizes below 1,000,000 bytes (the structure of this sheet is the same as the first sheet, except that the bin sizes are 10,000 bytes):

    The Gaussian seems to be a reasonable model for this set of data, although it is clear that the set of file sizes is not a truly random sample.

  3. "Full Data Set", which contains all 6342 file sizes (the structure of this sheet is the same as the first sheet). An approximate fit to the data shows that this is a multimodal distribution: it contains multiple sub-populations with different means and standard deviations:

    The Gaussian function plotted in red is

    G(x) = 1150 e- (x - 300000)2 / (2 * 1000002) + 34 e- (x - 2100000)2 / (2 * 2500002) + 85 e- (x - 4250000)2 / (2 * 5000002).
    These three sub-populations roughly correspond to 704x480 pixel images designed for web use, and the images taken with 10 Megapixel and 14 Megapixel cameras, respectively. Of course, as we noted above, there is significant noise in the data set, which we can see because the Gaussians are only approximate fits to the data.

It is clear from these examples that the population size, as well as the resolution of the binning, have a significant impact on the reliability of the model distribution.

We will now conclude this chapter, and the text, with a discussion of Computer Performance Modeling.


Go to:Title PageTable of ContentsIndex

©2012, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.

Please send comments or suggestions to the author.