Multiple Group Threats
The Central Issue
A multiple-group design typically involves at least two groups and before-after measurement. Most often, one group receives the program or treatment while the other does not and constitutes the “control” or comparison group. But sometimes one group gets the program and the other gets either the standard program or another program you would like to compare. In this case, you would be comparing two programs for their relative outcomes. Typically you would construct a multiple-group design so that you could compare the groups directly. In such designs, the key internal validity issue is the degree to which the groups are comparable before the study. If they are comparable, and the only difference between them is the program, posttest differences can be attributed to the program. But that’s a big if. If the groups aren’t comparable to begin with, you won’t know how much of the outcome to attribute to your program or to the initial differences between groups.
There really is only one multiple group threat to internal validity: that the groups were not comparable before the study. We call this threat a selection bias or selection threat. A selection threat is any factor other than the program that leads to posttest differences between groups. Whenever we suspect that outcomes differ between groups not because of our program but because of prior group differences we are suspecting a selection bias. Although the term ‘selection bias’ is used as the general category for all prior differences, when we know specifically what the group difference is, we usually hyphenate it with the ‘selection’ term. The multiple-group selection threats directly parallel the single group threats. For instance, while we have ‘history’ as a single group threat, we have ‘selection-history’ as its multiple-group analogue.
As with the single group threats to internal validity, we’ll assume a simple example involving a new compensatory mathematics tutoring program for first graders. The design will be a pretest-posttest design, and we will divide the first graders into two groups, one getting the new tutoring program and the other not getting it.
Here are the major multiple-group threats to internal validity for this case:
A selection-history threat is any other event that occurs between pretest and posttest that the groups experience differently. Because this is a selection threat, it means the groups differ in some way. Because it’s a ‘history’ threat, it means that the way the groups differ is with respect to their reactions to history events. For example, what if the children in one group differ from those in the other in their television habits. Perhaps the program group children watch Sesame Street more frequently than those in the control group do. Since Sesame Street is a children’s show that presents simple mathematical concepts in interesting ways, it may be that a higher average posttest math score for the program group doesn’t indicate the effect of our math tutoring – it’s really an effect of the two groups differentially experiencing a relevant event – in this case Sesame Street – between the pretest and posttest.
A selection-maturation threat results from differential rates of normal growth between pretest and posttest for the groups. In this case, the two groups are different in their different rates of maturation with respect to math concepts. It’s important to distinguish between history and maturation threats. In general, history refers to a discrete event or series of events whereas maturation implies the normal, ongoing developmental process that would take place. In any case, if the groups are maturing at different rates with respect to the outcome, we cannot assume that posttest differences are due to our program – they may be selection-maturation effects.
A selection-testing threat occurs when there is a differential effect between groups on the posttest of taking the pretest. Perhaps the test “primed” the children in each group differently or they may have learned differentially from the pretest. in these cases, an observed posttest difference can’t be attributed to the program, they could be the result of selection-testing.
Selection-instrumentation refers to any differential change in the test used for each group from pretest and posttest. In other words, the test changes differently for the two groups. Perhaps the test consists of observers who rate the class performance of the children. What if the program group observers, for example, get better at doing the observations while, over time, the comparison group observers get fatigued and bored. Differences on the posttest could easily be due to this differential instrumentation – selection-instrumentation – and not to the program.
Selection-mortality arises when there is differential nonrandom dropout between pretest and posttest. In our example, different types of children might drop out of each group, or more may drop out of one than the other. Posttest differences might then be due to the different types of dropouts – the selection-mortality – and not to the program.
Finally, selection-regression occurs when there are different rates of regression to the mean in the two groups. This might happen if one group is more extreme on the pretest than the other. In the context of our example, it may be that the program group is getting a disproportionate number of low math ability children because teachers think they need the math tutoring more (and the teachers don’t understand the need for ‘comparable’ program and comparison groups!). Since the tutoring group has the more extreme lower scorers, their mean will regress a greater distance toward the overall population mean and they will appear to gain more than their comparison group counterparts. This is not a real program gain – it’s just a selection-regression artifact.
When we move from a single group to a multiple group study, what do we gain from the rather significant investment in a second group? If the second group is a control group and is comparable to the program group, we can rule out the single group threats to internal validity because they will all be reflected in the comparison group and cannot explain why posttest group differences would occur. But the key is that the groups must be comparable. How can we possibly hope to create two groups that are truly “comparable”? The only way we know of doing that is to randomly assign persons in our sample into the two groups – we conduct a randomized or “true” experiment. But in many applied research settings we can’t randomly assign, either because of logistical or ethical factors. In that case, we typically try to assign two groups nonrandomly so that they are as equivalent as we can make them. We might, for instance, have one classroom of first graders assigned to the math tutoring program while the other class is the comparison group. In this case, we would hope the two are equivalent, and we may even have reasons to believe that they are. But because they may not be equivalent and because we did not use a procedure like random assignment to at least assure that they are probabilistically equivalent, we call such designs quasi-experimental designs. If we measure them on a pretest, we can examine whether they appear to be similar on key measures before the study begins and make some judgement about the plausibility that a selection bias exists.
Even if we move to a multiple group design and have confidence that our groups are comparable, we cannot assume that we have strong internal validity. There are a number of social threats to internal validity that arise from the human interaction present in applied social research that we will also need to address.