You might want a tolerance interval

While the confidence interval wins the interval popularity sweepstakes, I've seen it frequently misused, especially on blogs, sometimes in webapps, and even occasionally in conference papers, in situations where it's not telling you the right thing. A lot of the introductions to statistical intervals I can find are a bit daunting and highly technical, so hopefully this brief, minimally technical explanation will point someone toward the right keyword, convincing them (you?) that they really might like a tolerance interval instead.

Confidence intervals

Confidence intervals are what you see quoted with things like political polls: with 95% confidence, 57±4% of voters approve of Obama's job performance. The important thing about confidence intervals is that they only measure sampling error. That ±4% is only there because the polling firm called a random subset of Americans. If they managed to contact every single American, the error would be ±0%, because they would have an exact count of how many Americans responded each way to the survey question. So, an important feature: as sample size approaches population size (or infinite for a non-finite population), the confidence interval's size approaches zero, because you no longer have sampling error.

But consider this example I recently ran across (slightly anonymized). Say you've collected some data on traffic times. Should you report it as an average with a confidence interval? You could, but have to be careful about what you think that means. If we say that a trip between some pair of points takes 44±9 minutes, where the ±9 is a 95% confidence interval, that means something very specific: we have 95% confidence that the average travel time is between 35 and 53 minutes. What this does not tell you, but which was implied by this particular presentation, is that there will be a 95% chance that you, driving that stretch tomorrow, will take between 35 and 53 minutes. That's because your travel time uncertainty isn't only due to sampling error in the data used to make the prediction, but also to real variability, since travel times vary trip to trip. If we sampled a ton of travel-time data, the confidence interval would eventually collapse to near-zero, because we would have a nearly exact estimate of the average travel time. But we still wouldn't be able to exactly predict how long your specific trip tomorrow will take, because trip times vary.

Prediction intervals

The next most common interval is probably the prediction interval. This is closer to what we want, and sure sounds like it's the right thing, but it quite possibly isn't what you'll want either, and many uses of it are not quite right.

A prediction interval has the following interpretation. A 95% prediction interval is one where, if you sample some data, construct an interval from that data, and then sample one new data point, there is a 95% chance that the interval will contain that data point. What's crucial here is that this is 95% of the time you repeat the whole procedure. If you have an iterated process, this will have the expected interpretation. You sample some data points, use them to predict a new data point; sample some more, use them for another prediction; repeat. Then, 95% of the prediction intervals will contain their paired observation. But any single prediction interval from that iterated series may cover more or less than 95% of future samples; they only cover 95% on average. Thus we only get the 95% prediction rate through this iterative process, which in effect lets us use the "average" interval, rather than picking any one interval.

But what if we really want one interval? In the travel-time example, we want to be able to collect some data, then give a single interval bracketing trip times: you'll take between X1 and X2 minutes. Just picking one of the prediction intervals could be quite far off, depending on the sample size and the distribution of data points.

Tolerance intervals

A tolerance interval can be thought of as a prediction interval where we also want to have some confidence that the interval itself is a "good" one, because unlike in the iterated case, we're going to be keeping this one interval and reusing it a lot.

Therefore we now have two inputs: what percentage of the population we want to cover, and how high we want our confidence in the interval itself to be. For example, if we construct a (95%,50%) tolerance interval, this will tell us, with 95% confidence, that at least half of car trips will fall within the interval. The two numbers can be varied independently to choose the desired coverage and confidence.

This is probably the interval you want to use if you're both: 1) using sampled data to make predictions; and 2) trying to capture the range of probable outcomes, such as the range of travel times a driver could expect.

Unlike with a confidence interval, it has the expected behavior if we think of the case where our sample size approaches infinity. Sampling error goes to zero, but instead of the tolerance interval going to zero (like the confidence interval does), it approaches the population percentiles. If we had a very large amount of data, so that sampling error was negligible, we could find the middle 50% of car trips by just looking at how long the 25th and 75th percentile trips in our data set took. The tolerance interval extends that natural procedure to cases where we don't have huge samples, so can't necessarily trust that the 25th percentile of our data set is particularly close to the 25th percentile of the population.

Futher reading

There are a number of ways to actually calculate tolerance intervals, both parametric (e.g. assuming a normal distribution) and non-parametric, none covered here (sorry!). Many statistics packages have the functionality built in; for example, you might use the R package 'tolerance'.

A somewhat more technical introduction to these not-the-confidence-interval intervals, also lamenting their underuse, but unfortunately not freely available online, can be found in Stephen B. Vardeman's "What about the other intervals?" (The American Statistician, vol. 46, no. 3, pp. 193–197, 1992). If you really do want a prediction interval instead, a nice overview aimed at statistics educators can be found in Scott Preston's "Teaching Prediction Intervals" (Journal of Statistics Education, vol. 8, no. 3, 2000). An explanation of how to compute tolerance intervals for a normal distribution can be found here, in the NIST's Engineering Statistics Handbook.

Finally, an old but quite readable and practical explanation is in the 1960 textbook Statistics Manual, which can now be had for pennies. Written by three researchers at the U.S. Naval Ordnance Test Station, it points out the important difference between having a 99% confidence interval for where the average bomb will fall, versus having a bound on where 99% of bombs will fall!