Evidence Generation — Effect Size

Statistical significance is not the only metric used by researchers, especially in evaluation studies. Significance is often not even the best way to assess the importance of a finding. Significance levels can be misconstrued. The level of significance associated with a particular finding is a function of several factors, but it is mainly derived from the absolute size of a difference in combination with the number of cases or observations (the N) used to establish that difference. Even a seemingly large difference (e.g., 80 percent of the youth in Group A were re-arrested, but only 50 percent in Group B) may fail to reach the level of statistical significance because the finding was generated with a very small sample. For example, the study may have collected data on just ten youth in each group.

In contrast, even a small difference (e.g., 50% versus 53% recidivism) could be statistically significant if the data set being analyzed was sufficiently large. Few public officials, however, would want to risk their resources or their reputations on a difference of three percentage points, unless they could describe the difference in some other way, perhaps economically or in terms of individual public safety (e.g., number of crimes averted). Researchers who focus on statistical significance alone can fail to appreciate the substantive importance of a finding. It is not uncommon to hear investigators at academic conferences draw profound conclusions or policy implications from relatively minor differences that are “significant” mainly because they were found using very large data sets. Outside of academic discussions, however, such minor differences are seen as relatively unimportant.

In the policy arena, “effect size” increasingly serves as an alternative to statistical significance. Effect size is often defined as the change in an outcome divided by its standard deviation, a traditional measure of statistical variation. Measures of effect size can be constructed in other ways as well, but all of the measures have a common function, which is to estimate the magnitude of a treatment effect given a specified level of intervention. A study might estimate the change in the prevalence of recent drug use among a sample of individuals following their participation in a new type of treatment, or researchers might measure change in the frequency of anti-social behavior in an entire community following the implementation of new juvenile curfew laws. Effect size, rather than statistical significance, is the primary language used to describe the benefits of social interventions.

The “effect” of a delinquency program could be the change observed in an important indicator of client behavior (e.g., recidivism). One common way to gauge whether an effect is large or small is to compare the scale of change in recidivism to the mean or average recidivism rate among a population of interest. A program that reduces recidivism by 50 percent will generally have a larger effect size than a program that lowers recidivism by just 10 percent. However, percent change is very sensitive to the mean level of the outcome. When mean levels are small, such as when only 10 percent of a sample is expected to be re-arrested in the first place, a change of just 3 percentage points (from 10% to 7%) would produce a relative change of 30 percent. Another way to gauge the size of an effect is to compare the scale of change with the natural variation of the key variable. For example, if recidivism for a particular group of youth was known to fluctuate between 5 and 90 percent, a change of 3 percentage points would seem trivial. On the other hand, if recidivism rarely varied outside a 5-percentage point range, for example from 45 to 50 percent, a program able to produce a consistent decline of 3 percentage points would likely have a very strong effect size.

In practice, effect sizes for delinquency prevention programs usually fall between –.30 (i.e., decreases the likelihood of delinquency) and +.30 (i.e., increases the likelihood of delinquency). The most successful interventions usually have effect sizes between –.10 and –.30. In their summary of program evaluations, Aos and his colleagues (2001) reported that Multi-Dimensional Treatment Foster Care (MTFC) reduced crime on average by 22 percent, which translated to an effect size of –.37 given the other figures involved. Another well-known program, Multisystemic Therapy had an effect size of –.31. Other programs with strong results have included Nurse Home Visitation programs (–.29), Functional Family Therapy (–.25), and the Seattle Social Development approach (–.13) (Greenwood, 2006: 150).

Judging the evidence of delinquency interventions according to effect size alone, however, could also be inappropriate in some cases. Programs with lesser effect sizes may still be valuable. Some programs merit the label “evidence-based” because they generate a positive return on investment. A program that costs very little can be a worthwhile investment even if it has a relatively small effect size. For example, the Perry Preschool Project is considered highly successful by researchers despite its smaller effect size of –.10 (Greenwood, 2006). Programs like the Perry Preschool Project, of course, are the exception. For the most part, effect sizes for delinquency interventions need to fall in the range of –.15 to –.30 in order to have a real and lasting impact on policy and practice.


Excerpted from Butts, Jeffrey A. and John Roman (2011). Better Research for Better Policies, pp. 513-514, in Juvenile Justice: Advancing Research, Policy, and Practice. Sherman, Francine and Francine Jacobs (Editors). Hoboken, NJ: John Wiley & Sons.



Aos, Steven, Polly Phipps, Robert Barnoski, and Roxanne Lieb (2001). The Comparative Costs and Benefits of Programs to Reduce Crime (Version 4.0). Olympia, WA: Washington State Institute for Public Policy.

Greenwood, Peter W. (2006). Changing Lives—Delinquency Prevention as Crime-Control Policy. Chicago, IL: University of Chicago Press.