Measurement Validity Types
There’s an awful lot of confusion in the methodological literature that stems from the wide variety of labels that are used to describe the validity of measures. I want to make two cases here. First, it’s dumb to limit our scope only to the validity of measures. We really want to talk about the validity of any operationalization. That is, any time you translate a concept or construct into a functioning and operating reality (the operationalization), you need to be concerned about how well you did the translation. This issue is as relevant when we are talking about treatments or programs as it is when we are talking about measures. (In fact, come to think of it, we could also think of sampling in this way. The population of interest in your study is the “construct” and the sample is your operationalization. If we think of it this way, we are essentially talking about the construct validity of the sampling!). Second, I want to use the term construct validity to refer to the general case of translating any construct into an operationalization. Let’s use all of the other validity terms to reflect different ways you can demonstrate different aspects of construct validity.
With all that in mind, here’s a list of the validity types that are typically mentioned in texts and research papers when talking about the quality of measurement:
- Face validity
- Content validity
- Predictive validity
- Concurrent validity
- Convergent validity
- Discriminant validity
I have to warn you here that I made this list up. I’ve never heard of “translation” validity before, but I needed a good name to summarize what both face and content validity are getting at, and that one seemed sensible. All of the other labels are commonly known, but the way I’ve organized them is different than I’ve seen elsewhere.
Let’s see if we can make some sense out of this list. First, as mentioned above, I would like to use the term construct validity to be the overarching category. Construct validity is the approximate truth of the conclusion that your operationalization accurately reflects its construct. All of the other terms address this general issue in different ways. Second, I make a distinction between two broad types: translation validity and criterion-related validity. That’s because I think these correspond to the two major ways you can assure/assess the validity of an operationalization. In translation validity, you focus on whether the operationalization is a good reflection of the construct. This approach is definitional in nature – it assumes you have a good detailed definition of the construct and that you can check the operationalization against it. In criterion-related validity, you examine whether the operationalization behaves the way it should given your theory of the construct. This is a more relational approach to construct validity. it assumes that your operationalization should function in predictable ways in relation to other operationalizations based upon your theory of the construct. (If all this seems a bit dense, hang in there until you’ve gone through the discussion below – then come back and re-read this paragraph). Let’s go through the specific validity types.
I just made this one up today! (See how easy it is to be a methodologist?) I needed a term that described what both face and content validity are getting at. In essence, both of those validity types are attempting to assess the degree to which you accurately translated your construct into the operationalization, and hence the choice of name. Let’s look at the two types of translation validity.
In face validity, you look at the operationalization and see whether “on its face” it seems like a good translation of the construct. This is probably the weakest way to try to demonstrate construct validity. For instance, you might look at a measure of math ability, read through the questions, and decide that yep, it seems like this is a good measure of math ability (i.e., the label “math ability” seems appropriate for this measure). Or, you might observe a teenage pregnancy prevention program and conclude that, “Yep, this is indeed a teenage pregnancy prevention program.” Of course, if this is all you do to assess face validity, it would clearly be weak evidence because it is essentially a subjective judgment call. (Note that just because it is weak evidence doesn’t mean that it is wrong. We need to rely on our subjective judgment throughout the research process. It’s just that this form of judgment won’t be very convincing to others.) We can improve the quality of face validity assessment considerably by making it more systematic. For instance, if you are trying to assess the face validity of a math ability measure, it would be more convincing if you sent the test to a carefully selected sample of experts on math ability testing and they all reported back with the judgment that your measure appears to be a good measure of math ability.
In content validity, you essentially check the operationalization against the relevant content domain for the construct. This approach assumes that you have a good detailed description of the content domain, something that’s not always true. For instance, we might lay out all of the criteria that should be met in a program that claims to be a “teenage pregnancy prevention program.” We would probably include in this domain specification the definition of the target group, criteria for deciding whether the program is preventive in nature (as opposed to treatment-oriented), and lots of criteria that spell out the content that should be included like basic information on pregnancy, the use of abstinence, birth control methods, and so on. Then, armed with these criteria, we could use them as a type of checklist when examining our program. Only programs that meet the criteria can legitimately be defined as “teenage pregnancy prevention programs.” This all sounds fairly straightforward, and for many operationalizations it will be. But for other constructs (e.g., self-esteem, intelligence), it will not be easy to decide on the criteria that constitute the content domain.
In criteria-related validity, you check the performance of your operationalization against some criterion. How is this different from content validity? In content validity, the criteria are the construct definition itself – it is a direct comparison. In criterion-related validity, we usually make a prediction about how the operationalization will perform based on our theory of the construct. The differences among the different criterion-related validity types is in the criteria they use as the standard for judgment.
In predictive validity, we assess the operationalization’s ability to predict something it should theoretically be able to predict. For instance, we might theorize that a measure of math ability should be able to predict how well a person will do in an engineering-based profession. We could give our measure to experienced engineers and see if there is a high correlation between scores on the measure and their salaries as engineers. A high correlation would provide evidence for predictive validity – it would show that our measure can correctly predict something that we theoretically think it should be able to predict.
In concurrent validity, we assess the operationalization’s ability to distinguish between groups that it should theoretically be able to distinguish between. For example, if we come up with a way of assessing manic-depression, our measure should be able to distinguish between people who are diagnosed manic-depression and those diagnosed paranoid schizophrenic. If we want to assess the concurrent validity of a new measure of empowerment, we might give the measure to both migrant farm workers and to the farm owners, theorizing that our measure should show that the farm owners are higher in empowerment. As in any discriminating test, the results are more powerful if you are able to show that you can discriminate between two groups that are very similar.
In convergent validity, we examine the degree to which the operationalization is similar to (converges on) other operationalizations that it theoretically should be similar to. For instance, to show the convergent validity of a Head Start program, we might gather evidence that shows that the program is similar to other Head Start programs. Or, to show the convergent validity of a test of arithmetic skills, we might correlate the scores on our test with scores on other tests that purport to measure basic math ability, where high correlations would be evidence of convergent validity.
In discriminant validity, we examine the degree to which the operationalization is not similar to (diverges from) other operationalizations that it theoretically should be not be similar to. For instance, to show the discriminant validity of a Head Start program, we might gather evidence that shows that the program is not similar to other early childhood programs that don’t label themselves as Head Start programs. Or, to show the discriminant validity of a test of arithmetic skills, we might correlate the scores on our test with scores on tests that of verbal ability, where low correlations would be evidence of discriminant validity.