Share This:
November 2017
By David A. Kilpatrick
Two current buzzword terms in education are research based and evidence based. These terms suffer from the same basic problem—they are “unprotected.” This means that anyone can use them. The term research based attached to a reading program means nothing more than “please buy our program.” How then can we determine the actual effectiveness of reading instructional approaches or interventions? Let us consider four ways.
Raw-Score Gains
This involves tracking raw-score progress (e.g., words correct per minute [wcpm] in oral reading). Despite its popularity and intuitive appeal, this approach has limited value in determining intervention effectiveness because it does not tell us if a student is catching up to his or her peers. It only tells us whether or not the student is moving forward. Consider a second-grade student with dyslexia who improves from 12 wcpm in the fall to 36 wcpm in the spring. This tripling of the raw score may seem impressive. However, the average-achieving classmates of that student have moved from about 50 wcpm to 95 wcpm during the same period of time. Despite what may have seemed like an impressive raw-score gain, the 38-wcpm gap between this student and his peers in the fall has widened to a 59-wcpm gap in the spring. The peer-based comparison shows that the gap between the struggling reader and his peers is widening. Thus, without peer-based comparisons, raw-score progress can only tell us if a student is moving forward. It cannot tell us whether a student is narrowing or widening the gap (i.e., catching up or falling further behind).
Statistical Significance
In research studies, statistical significance is designed to indicate whether the difference between experimental (intervention) group and control (comparison) group performance is likely to be genuine. However, it does not tell us the magnitude of the difference. Many studies display statistically significant results that only amount to gains of 3 or 4 standard score points. Such gains are so small that teachers or parents are not likely to detect any observable difference in the student’s real-world reading performance. Statistically significant is not the same as educationally meaningful. Thus, statistically significant does not necessarily mean effective.
Statistically significant is not the same as educationally meaningful. Thus, statistically significant does not necessarily mean effective.
Effect Size
Effect size is the universal language of intervention research. All scientific journals require it. Effect size tells us the magnitude of improvement. An effect size of 1.0 equals one standard deviation within a population. Consider the standard scores used in tests of IQ, educational achievement, speech pathology, etc. A standard score of 100 is the mean or midpoint (i.e., 50th percentile), and a standard deviation is equal to 15 standard score points. Thus, an effect size of 1.0 is equivalent to 15 standard score points. If a student’s performance earns a standard score of 85 (16th percentile) on a reading test, that is one standard deviation below the mean of 100. If a student shows an effect-size gain of +1.0, that student has made the equivalent of a 15-point standard score gain and is now at 100 (50th percentile). With that 1.0 effect-size improvement, the student advanced from a below average reader to an average reader. Right? Not necessarily. Despite its pervasive use in research reports, effect size is seriously problematic as an index of intervention effectiveness. Consider the following examples:
Despite its pervasive use in research reports, effect size is seriously problematic as an index of intervention effectiveness.
- A 2012 reading intervention study in the Journal of Learning Disabilities¹ reported an effect size of .49. That is half a standard deviation—equivalent to about 7.5 standard score points. Yet the difference between the normative-based pretest and post-test standard scores was 0! How can that be? It turns out that the normative scores of the comparison, or control, group decreased substantially during the intervention period. The effect size is a comparison with a specific control group, not with the general population. Nationally, the students in the experimental (intervention) group did not narrow the gap with their peers at all—despite the respectable effect size.
- In a 2017 report in the Journal of Learning Disabilities,² a popular intervention demonstrated an effect size of .96. That represents nearly a full standard deviation. Yet scores for the experimental (intervention) group increased only by about one-half of a standard score point! That is equivalent to about 1/30th of a standard deviation—not even close to a whole standard deviation as the large .96 effect size implies. How could that happen? This was a study of a summer tutoring program. After 100 hours of 1:1 instruction, the students did not catch up in the slightest with their peers in terms of normative improvement. The large effect size occurred because the control group’s scores decreased dramatically over the summer.
- Distortions based on effect size not only make ineffective approaches look effective; they can do the opposite as well. In a 2010 report in Annals of Dyslexia,³ two experimental (intervention) groups averaged amazing gains of 22 standard score points. Low average and below-average readers became average readers—but the effect size was only .53. That is a statistical tie with the .49 effect size from the 2012 report discussed above. This modest effect size occurred because the control group received a highly effective school-based intervention independent of the researchers—and demonstrated a gain of 14 standard score points. Such a gain is unprecedented for a control group in the intervention literature. The net effect is that it made the 22-point gain of the experimental (intervention) groups look less impressive.
Using effect size as the metric for intervention effectiveness, we would have to conclude that normative gains of 0 standard score points and 22 standard score points comparing peers across the country represent the same level of effectiveness. This is clearly nonsensical. Studies like these show us that effect sizes represent comparisons of experimental (intervention) groups with specific control groups within specific studies—not national norms. Essentially, effect sizes show comparisons with moving targets. As a result, they do not allow us to know if an intervention is truly effective.
With all due respect to their laudable intentions, we must now consider reports coming out of What Works Clearinghouse, bestevidence.org, and similar outlets with a healthy skepticism.
With all due respect to their laudable intentions, we must now consider reports coming out of What Works Clearinghouse, bestevidence.org, and similar outlets with a healthy skepticism.
Normative Standard Scores
The correlation between various word identification subtests from commercially available achievement tests is very high. This indicates that all those tests are doing a good job of representing and measuring the actual word-identification skills of the general population of students across the country. (Unfortunately, this is not the case for different subtests measuring reading comprehension.) Normative-based word-identification subtests provide us with a stable point of reference in determining intervention effectiveness. It appears to be the only reliable metric for telling us if a student is catching up with his or her same-aged peers. Raw-score gains, statistical significance, and effect size cannot tell us that. While not useful for short-term progress monitoring, normative scores can be useful for annual check ups. Studies demonstrating the most highly effective intervention outcomes in word-level reading showed gains of 12 to 25 standard score points in less than half a school year.4
Conclusion
Before we adopt a teaching approach or program, we must ask those selling or promoting them for data showing standard score gains on word-level reading tests. That seems to be the most valid method of determining program effectiveness.
References
¹Vaughn, S., Wexler, J., Leroux, A., Roberts, G., Denton, C., Barth, A., & Fletcher, J. (2012). Effects of intensive reading intervention for eighth-grade students with persistently inadequate response to intervention. Journal of Learning Disabilities, 45(6), 515–525.
²Christodoulou, J. A. (2017). Impact of intensive summer reading intervention for children with reading disabilities and difficulties in early elementary school. Journal of Learning Disabilities, 50(2), 115–127.
³Torgesen, J. K., Wagner, R. K., Rashotte, C. A., Herron, J., & Lindamood, P. (2010). Computer-assisted instruction to prevent early reading difficulties in students at risk for dyslexia: Outcomes from two instructional approaches. Annals of Dyslexia, 60, 40–56.
4Kilpatrick, D. A. (2015). Essentials of assessing, preventing, and overcoming reading difficulties. Hoboken, NJ: Wiley.
David A. Kilpatrick, PhD, is an associate professor of psychology for the State University of New York, College at Cortland. He is a New York State-certified school psychologist with 28 years experience in schools. He has been teaching courses in learning disabilities and educational psychology since 1994. David is a reading researcher and the author of two books on reading: Essentials of Assessing, Preventing, and Overcoming Reading Difficulties and Equipped for Reading Success.
Copyright © 2017 International Dyslexia Association (IDA). Opinions expressed in The Examiner and/or via links do not necessarily reflect those of IDA.
We encourage sharing of Examiner articles. If portions are cited, please make appropriate reference. Articles may not be reprinted for the purpose of resale. Permission to republish this article is available from info@dyslexiaida.org.