Theory of Reliability
What is reliability? We hear the term used a lot in research contexts, but what does it really mean? If you think about how we use the word “reliable” in everyday language, you might get a hint. For instance, we often speak about a machine as reliable: “I have a reliable car.” Or, news people talk about a “usually reliable source”. In both cases, the word reliable usually means “dependable” or “trustworthy.” In research, the term “reliable” also means dependable in a general sense, but that’s not a precise enough definition. What does it mean to have a dependable measure or observation in a research context? The reason “dependable” is not a good enough description is that it can be confused too easily with the idea of a valid measure (see Measurement Validity). Certainly, when we speak of a dependable measure, we mean one that is both reliable and valid. So we have to be a little more precise when we try to define reliability.
In research, the term reliability means “repeatability” or “consistency”. A measure is considered reliable if it would give us the same result over and over again (assuming that what we are measuring isn’t changing!).
Let’s explore in more detail what it means to say that a measure is “repeatable” or “consistent”. We’ll begin by defining a measure that we’ll arbitrarily label X. It might be a person’s score on a math achievement test or a measure of severity of illness. It is the value (numerical or otherwise) that we observe in our study. Now, to see how repeatable or consistent an observation is, we can measure it twice. We’ll use subscripts to indicate the first and second observation of the same measure. If we assume that what we’re measuring doesn’t change between the time of our first and second observation, we can begin to understand how we get at reliability. While we observe a score for what we’re measuring, we usually think of that score as consisting of two parts, the ‘true’ score or actual level for the person on that measure, and the ‘error’ in measuring it (see True Score Theory).
It’s important to keep in mind that we observe the X score – we never actually see the true (T) or error (e) scores. For instance, a student may get a score of 85 on a math achievement test. That’s the score we observe, an X of 85. But the reality might be that the student is actually better at math than that score indicates. Let’s say the student’s true math ability is 89 (i.e., T=89). That means that the error for that student is -4. What does this mean? Well, while the student’s true math ability may be 89, he/she may have had a bad day, may not have had breakfast, may have had an argument, or may have been distracted while taking the test. Factors like these can contribute to errors in measurement that make the student’s observed ability appear lower than their true or actual ability.
OK, back to reliability. If our measure, X, is reliable, we should find that if we measure or observe it twice on the same persons that the scores are pretty much the same. But why would they be the same? If you look at the figure you should see that the only thing that the two observations have in common is their true scores, T. How do you know that? Because the error scores (e1 and e2) have different subscripts indicating that they are different values. But the true score symbol T is the same for both observations. What does this mean? That the two observed scores, X1 and X2 are related only to the degree that the observations share true score. You should remember that the error score is assumed to be random. Sometimes errors will lead you to perform better on a test than your true ability (e.g., you had a good day guessing!) while other times it will lead you to score worse. But the true score – your true ability on that measure – would be the same on both observations (assuming, of course, that your true ability didn’t change between the two measurement occasions).
With this in mind, we can now define reliability more precisely. Reliability is a ratio or fraction. In layperson terms we might define this ratio as:
the entire measure
the variance of the true score
the variance of the measure
We’re getting to the critical part now. If you look at the equation above, you should recognize that we can easily determine or calculate the bottom part of the reliability ratio – it’s just the variance of the set of scores we observed (You remember how to calculate the variance, don’t you? It’s just the sum of the squared deviations of the scores from their mean, divided by the number of scores). But how do we calculate the variance of the true scores. We can’t see the true scores (we only see X)! Only God knows the true score for a specific observation. And, if we can’t calculate the variance of the true scores, we can’t compute our ratio, which means we can’t compute reliability! Everybody got that? The bottom line is…
we can’t compute reliability because we can’t calculate the variance of the true scores
Great. So where does that leave us? If we can’t compute reliability, perhaps the best we can do is to estimate it. Maybe we can get an estimate of the variability of the true scores. How do we do that? Remember our two observations, X1 and X2? We assume (using true score theory) that these two observations would be related to each other to the degree that they share true scores. So, let’s calculate the correlation between X1 and X2. Here’s a simple formula for the correlation:
sd(X1) * sd(X2)
It’s time to reach some conclusions. We know from this discussion that we cannot calculate reliability because we cannot measure the true score component of an observation. But we also know that we can estimate the true score component as the covariance between two observations of the same measure. With that in mind, we can estimate the reliability as the correlation between two observations of the same measure. It turns out that there are several ways we can estimate this reliability correlation. These are discussed in Types of Reliability.
There’s only one other issue I want to address here. How big is an estimate of reliability? To figure this out, let’s go back to the equation given earlier:
and remember that because X = T + e, we can substitute in the bottom of the ratio:
var(T) + var(e)