Evidence Generation — Effect Size - John Jay College Research and Evaluation Center

weigh
Statistical significance is not the only metric used by researchers, especially in evaluation studies. Significance is often not even the best way to assess the importance of a finding. Significance levels can be misconstrued. The level of significance associated with a particular finding is a function of several factors, but it is mainly derived from the absolute size of a difference in combination with the number of cases or observations (the N) used to establish that difference, as well as the extent to which individual measures vary around the overall average.

Even a seemingly large difference (e.g., 80 percent in Group A versus 50 percent in Group B) may fail to reach the level of statistical significance because the finding was generated with a very small sample or individual measures were widely dispersed. For example, the study may have collected data on just ten youth in each group and measures in those groups may have ranged from 10 to 90 percent.

In contrast, even a small difference (e.g., 50% versus 53% recidivism) could be statistically significant if the data set being analyzed was sufficiently large. Few public officials, however, would want to risk their resources or their reputations on a difference of three percentage points, unless they could describe the difference in some other way, perhaps economically or in terms of individual public safety (e.g., number of crimes averted).

Researchers who focus on statistical significance alone may fail to appreciate the substantive importance of a finding. It is not uncommon to hear investigators at academic conferences draw profound conclusions or policy implications from relatively minor differences that are “significant” mainly because they were found using very large data sets. Outside of academic discussions, however, such minor differences are seen as relatively unimportant.

In the policy arena, “effect size” serves as an alternative to statistical significance. Effect size is often defined as the change in an outcome divided by its standard deviation, a measure of variation. Measures of effect size can be constructed in other ways as well, but all of the measures have a common function, which is to estimate the magnitude of a treatment effect given a specified level of intervention. A study might estimate the change in the prevalence of recent drug use among a sample of individuals following their participation in a new type of treatment, or researchers might measure the change in frequency of anti-social behavior in an entire community following the implementation of a new law. Effect size, rather than statistical significance, is the primary language used to describe the benefits of social interventions.

The “effect” of a public safety program could be the change observed in an indicator of client behavior (e.g., recidivism). One common way to gauge whether an effect is large or small is to compare the scale of change in recidivism to the mean or average recidivism rate among a population of interest. A program that reduces recidivism by 50 percent will generally have a larger effect size than a program that lowers recidivism by just 10 percent. However, the value of such change is sensitive to the mean level of the outcome. When mean levels are small, such as when only 10 percent of a sample is expected to be re-arrested in the first place, a change of just 3 percentage points (from 10% to 7%) would produce a relative change of 30 percent.

Another way to gauge the size of an effect is to compare the scale of change with the expected variation of a key variable. For example, if recidivism for a particular group of youth was known to fluctuate between 5 and 90 percent, a change of 3 percentage points would seem trivial. On the other hand, if recidivism was always between 45 and 50 percent, a program able to produce a consistent decline of 3 percentage points could have a very strong effect size.

Judging the evidence of interventions according to effect size alone, however, could also be inappropriate in some cases. Programs with lesser effect sizes may still be valuable. Some programs merit the label “evidence-based” because they generate a positive return on investment. A program that costs very little can be a worthwhile investment even if it has a relatively small effect size. For example, the Perry Preschool Project was considered highly successful by researchers despite its smaller effect size of –.10. For the most part, however, effect sizes need to fall in the range of –.15 to –.30 in order to have a real and lasting impact on policy and practice.

_______

Adapted from Butts, Jeffrey A. and John Roman (2011). Better Research for Better Policies, pp. 513-514, in Juvenile Justice: Advancing Research, Policy, and Practice. Sherman, Francine and Francine Jacobs (Editors). Hoboken, NJ: John Wiley & Sons.

____________________

References

Aos, Steven, Polly Phipps, Robert Barnoski, and Roxanne Lieb (2001). The Comparative Costs and Benefits of Programs to Reduce Crime (Version 4.0). Olympia, WA: Washington State Institute for Public Policy.

Gies, Stephen V., Lindsey M. Nichols, Frank Mojekwu, Rob T. Guerette and Emily E. Tanner-Smith (2023). Applying an empirically derived effect size distribution to benchmark the practical magnitude of interventions to reduce recidivism in the USA. Journal of Experimental Criminology, 20: 817-841.

Greenwood, Peter W. (2006). Changing Lives—Delinquency Prevention as Crime-Control Policy. Chicago, IL: University of Chicago Press.