Test question quandary: multiple-choice exams reduce higher-level thinking

Last fall, I read an article in CBE-Life Sciences Education by Kathrin F. Stanger-Hall Multiple-choice exams: an obstacle for higher-level thinking in introductory science classes. (CBE-Life Sciences Education, 2012, Vol 11(3), 294-306.) I was interested and disturbed by the findings … though not entirely surprised by them. When I got the opportunity to choose a paper for the oCUBE Journal Club, this was the one that first came to mind, as I’ve wanted to talk to other educators about it. I’m looking forward to talking to oCUBErs, but I suspect that there are many other educators who would also be interested in this paper, and some of the questions/concerns that it prompts.

The study:

Graph showing lower fairness in grading SET in MC+SA group
Figure 4. from Stanger-Hall (2012). “Student evaluations at the end of the semester. The average student evaluation scores from the MC + SA class are shown relative to the MC class (baseline).” Maybe reports of student evaluations of teaching should also include a breakdown of assessments used in each class?

Stanger-Hall conducted a study with two large sections of an introductory biology course, taught in the same term by the same instructor (herself), with differences in the types of questions used on tests for each section.  One section was tested on midterms by multiple-choice (MC) questions only, while midterms in the other section included a mixture of both MC questions and constructed-response (CR) questions (e.g., short answer, essay, fill-in-the blank), referred to as MC+SA in the article. She had a nice sample size: 282 students in the MC section, 231 in the MC+SA section. All students were introduced to Bloom’s Taxonomy of thinking skills, informed that 25-30% of exam questions would test higher-level thinking*, and provided guidance regarding study strategies and time.  Although (self-reported) study time was similar across sections, students in the MC+SA section performed better on the portion of the final exam common to both groups, and reported use of more active study strategies vs. passive ones. Despite higher performance, the MC+SA students did not like the CR questions, and rated “fairness in grading” lower than those in the MC-only section. (I was particularly struck by Figure 4, illustrating this finding.)

Student resistance to learning is discussed in the paper, as student surveys throughout the course showed a drop in reported appreciation of “learning on all levels”, but more dramatically in the MC+SA section. (At the beginning of the course, students in the MC+SA section reported seeing higher value in learning on all levels than the MC-only group; I would have liked to have seen student responses to this question BEFORE students were aware of how they would be assessed. Of course, that may not have been practical.)  Stanger-Hall suggests that the feedback provided by CR questions during the course may have prompted students to re-evaluate their own competencies, and those whose perceptions of their own competency were lowered also reported decreased appreciation of learning on all levels. (Learning is brain work!)

There was also an indication of a possible gender bias – the performance of male students on MC questions of the final exam was higher than that of the female students, which was not observed with final exam CR questions in this study. (Interestingly, in the introduction, it is mentioned that some previous reports have indicated males are at an advantage with MC,  while others have shown an advantage with CR questions for males – references are cited in Stanger-Hall paper.)

My thoughts/questions/concerns:

Nearly all the courses I’ve taught over the past decade would be considered large (or very large) classes – 100 to 550 students in a section. (One summer, I taught a smaller group of just over 30 students in Microbiology – a third year course at York University – which was a lot of fun, as I had so many more options for class activities and assessments, and I got to know many of my students a lot better than I normally have the opportunity to.) MC questions on tests/exams are a staple when I have >100 students, and (like many of my colleagues) I spend a lot of time trying to construct “good” MC questions, aligning the questions with learning objectives, and ensuring that questions test various levels of Bloom’s taxonomy in the cognitive domain (and the SOLO-Structure of Observed Learning Outcome (SOLO) taxonomy – see Biggs & Collis, 1982). In these classes, where possible (i.e., when I have some assistance grading, or a lighter teaching load), I include (more) CR questions on tests – I like including different question types, and the answers often give some insight as to student thinking/understanding that may not be revealed from MC scores. The practical issues with grading CR questions (consistently and in a reasonable time-frame), along with managing/returning test papers, are not insignificant in large classes. While I think the opportunity to provide individual feedback is much greater with graded CR questions, I’m not sure how many students reflect on this feedback. And, in some courses, I’ve encountered resistance to inclusion of CR questions (even though students perform better, on average, on the CR questions compared to the MC ones).  One of Stanger-Hall’s suggestions to overcome student resistance to learning is that all introductory science courses should use a mixed exam format. I have wondered about the impact of the assessments used in other courses on student expectations, particularly as I have had some Nursing students complain about CR questions on tests (before and after taking the tests), explaining that their experience has primarily been with MC-only exams. (Some of these students also expressed comfort with having large numbers of MC questions on an exam – 80 or 90 questions on a midterm (50-80 minutes long). While I can only speculate, as I haven’t seen any examples of such tests, I have to wonder how many of those questions could be higher-level thinking questions …!) Is this also evidence of resistance to learning, and could increased resistance outweigh the benefits of including CR questions on tests?

As instructors teaching large classes, we face a number of challenges in assessment. Development of fair and meaningful assessments, consistent grading with constructive feedback returned quickly – while attempting to foster and evaluate learning – are labour-intensive, particularly in large classes. With limited time/resources, one may have to make some difficult choices of where to spend energy/TA hours (e.g., in balance of formative vs. summative assessments). I find it daunting that even if an instructor uses higher-level thinking MC questions, just the knowledge that the assessments are via MC questions may affect student approaches to learning (and quality of learning). (Would inclusion of just a single CR question on each test/exam that is otherwise MC promote higher-level thinking?)

Student discomfort with CR questions, and the perception of unfairness in grading associated with CR questions (reflected in student evaluations of teaching, as shown in this study) should probably be considered in terms of tenure/promotion, and I wonder if these student perceptions also have an impact on the dynamic in the classroom. As individual instructors, are there some ways we can mitigate student concerns about CR questions/grading? Stanger-Hall suggests that additional instruction in how to properly answer CR questions might help overcome student resistance – this might be worth investigating to see if it would make a significant difference in terms of student attitudes and performance.

This study also makes me wonder about the impact of the questions on standardized exams like the MCAT, GRE, and CRNE on student learning in preparation for them. In introductory biology classes, many of our students hope to go to medical school, and are aware that MCAT scores can be important for admission. It should be noted that an analysis of MCAT biology questions showed that the MCAT (and GRE) actually included more higher-order thinking MC questions (ranked via Bloom’s taxonomy) than the first year medical school, undergraduate, and AP courses under evaluation; even including CR questions (in AP and some of the undergraduate exams) , there was “no significant difference in average-weighted Bloom’s rating or the proportion of higher-order questions” comparing the biology questions on the MCAT (all MC) to the biology questions on other tests (Zheng et al, 2008). The CRNE – of interest to me, as I teach Nursing students – previously included a mix of both MC and CR questions, but is now MC-only.

The potential issue of gender bias is worrisome (if not conclusive). I found myself wondering if there have been studies that might reveal whether assessment types could also be biased towards/against other groups of students (e.g., visible minorities) – I have not yet done a comprehensive search. If bias exists, is there a way we can “level the playing field”? (I’m guessing probably not, at least until we have some idea of *why* such a bias exists … and if there are other groups that might be disadvantaged by certain types of assessments.)

It seems likely that student perceptions regarding MC and CR questions (i.e., that MC questions are “easier”, and memorization is adequate as a studying strategy; CR questions are more difficult, and involve higher-level thinking) are established before they reach college/university. Are K-12 educators are struggling with this as well?

And, finally (and somewhat urgently!) … how am I going to assess the >340 students in introductory microbiology (about 100 of those being Nursing students) this coming fall? Somehow, I suspect that however I answer this, I won’t be completely happy/satisfied about it.


Biggs, J. B. and Collis, K. (1982). Evaluating the Quality of Learning: the SOLO taxonomy. New York, Academic Press.

Stanger-Hall, K. F. (2012). Multiple-choice exams: an obstacle for higher-level thinking in introductory science classes. CBE-Life Sciences Education, 11(3), 294-306.

Zheng, AY, Lawhorn, JK, Lumley, T., and Freeman, S. (2008). Application of Bloom’s Taxonomy Debunks the” MCAT Myth”. Science 319: 414 – 415.

* ADDENDUM (based on some Twitter discussions):

Folks interested in constructing MC questions addressing higher cognitive levels (and/or convincing other people that it’s possible to do this!) might want to check out:

Azer, S.A. (2003). Assessment in a problem-based learning course: Twelve tips for constructing multiple choice questions that test students’ cognitive skills. Biochemistry & Molecular Biology Education 31 (6): 428–434.

Clifton, S. L., & Schriner, C. L. (2010). Assessing the quality of multiple-choice test items. Nurse educator, 35(1), 12-16.

DiBattista, D. (2008). 21. Making the Most of Multiple-Choice Questions: Getting Beyond Remembering. Collected Essays on Learning and Teaching, 1.

Morrison S, and Walsh Free K.  (2001). Writing multiple choice test items that promote and measure critical thinking. J Nurs Educ. 40(1):17-24.

Palmer, E.J. and Devitt, P.G.  (2007). Assessment of higher order cognitive skills in undergraduate education: Modified essay or multiple choice questions? BMC Medical Education 7: 49.

Simkin, M. G., & Kuechler, W. L. (2005). Multiple‐Choice Tests and Student Understanding: What Is the Connection?. Decision Sciences Journal of Innovative Education, 3(1), 73-98.

Also, Henry L. Roediger III has published a number of studies on multiple choice exams and the testing effect. (Thanks to Catherine Rawn for the reminder!)

2 thoughts on “Test question quandary: multiple-choice exams reduce higher-level thinking

  1. You mentioned that “even though students perform better, on average, on the CR questions compared to the MC ones”. I’ve found in my evolution/ecology classes students often do worse on these because they can’t explain these concepts (they throw in jargon, but often it’s misused) or they don’t answer the question. I wonder if there are discipline-specific (or sub-discipline?) differences in performance on MC vs. CR questions? (I’ve thought about this after looking at short-answer questions for the cell biology/genetics half of our first year course, in which students tend to perform on par with their MC.) Or maybe my CR questions are just not well written and thus misinterpreted. (I’m definitely not discounting the latter.)

    1. Interesting points! My first thought was to wonder if there might be a difference in terms of student level/experience – most recently, I’ve been teaching students who are at least in second year, or higher, so there could be differences relating to experience/background compared to first year students. But perhaps the type of concept itself (within a sub-discipline) is key – it seems probably that there would be more misconceptions/confusion when students answer questions on the more abstract concepts (whether in evolution, cell biology, microbiology, whatever) compared to the ones that are more concrete. (Hmmm … Have my CR questions have been on more concrete concepts?) I wonder if this might be the case for any “troublesome knowledge” questions …?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s