A recent article in the Chronicle reported issues on the minds of college administrators and faculty as they contemplate the future of higher education. One arguably important issue - whether college instruction and evaluation should conform modern learning and evaluation sciences - was not mentioned. This omission is odd in light of growing concerns about costs, outcomes, institutional effectiveness, impact, time-to-degree, and other matters that rest on effectiveness and efficiency.
Pre-scientific Teaching Methods
Higher education employs a 1906 teaching model that is pre-scientific with respect to the relevant learning sciences. We are familiar with the 1906 model not only because we were taught by professors who employed it, we are familiar because we employ it or a close variant in our own teaching.
The model is, “Read this chapter, answer these chapter questions, listen to me lecture, and take this test.” The model has been modified slightly over the years to include small changes as higher education became less the exclusive province of the rich and the smart and more the public path to careers. The most common modification added, “chat a little about this as a class” somewhere in the middle of the sequence.
Teaching out of a 1906 playbook was defensible in 1906. At that time, early learning science research was underway in a few psychology labs but the findings were inconsistent and had limited practical relevance for the university classroom. Teaching out of a 1906 playbook is not defensible in 2016 following decades of generalizations from cognitive, affective, and brain research pertinent to improving learning, retaining, generalizing, and applying what is taught in the classroom.
The case against 1906 can almost be made by way of analogy. How would we respond to a physician who deliberately ignored the last 100 year's of progress in the medical sciences? Would anyone pay this person to engage in his trade?
The 1906 playbook has several defining attributes that are worth examining. Most prominent is that teaching and learning describe a vertical relationship. The professor is the source of knowledge that is disseminated downward to the student. Horizontal learning (student-to-student and learning team to learning team) is not provided for in the belief system or the classroom structure and is not evaluated when it occurs. Teaching and learning tend to center on low level cognitive content. By low level, I do not necessarily mean simple or easy to master. I mean that the content centers on knowing something, however difficult or complex, rather than demonstrating a proficiency to do something that subsumes that knowing.
Teaching in 1906 also emphasized the development of abstract cognitive mastery as demonstrated by recall over the kind of demonstrated competence and enhanced learning that is associated with teaching to others what you are attempting to master.
A Responsive Audience
Empirically, 1906 teaching works reasonably well for students who are intelligent in the field of study and whose strengths tend toward abstract thinking. That is, it works reasonably well for almost everyone reading this Executive Briefing. It is therefore not surprising that we teach to people like ourselves.
Multiple choice tests, essays, and term papers – most of them executed in a high stakes environment – are the stock in trade of the 1906 model and today's professoriate. All of these common tools are riddled with multiple kinds of invalidity. While it might be useful to develop a more technical Briefing on this topic (email me if you would like to see one), the common threats to validity of these common methods of assessing learning can be summarized in a non-technical fashion.
Essays and term papers receive different grades if re-evaluated after a lapse of time or when identical content is presented in a neatly organized laser printed document and a less neatly organized handwritten paper. Blind evaluations (students' names removed) receive different grades and comments than the same papers unblinded. Gender identify and surnames can affect grading and comments in many ways. In general, the most common weakness in essay tests is low reliability (e.g., score/re-score). With training and diligence, these structural sources of invalidity can be reduced but not eliminated. Blind scoring and scoring rubrics can increase reliability coefficients but only up to a point and only if they are competently developed and applied. To get decent reliability statistics, it is often necessary to score independently using two independent scorers trained to proficiency using carefully constructed dummy essays.
Multiple-choice tests contain a large number of questions that do not discriminate sharply between knowledgeable and uninformed students. Some of these items, commonly more than 20%, discriminate negatively, meaning that low achieving students tend to answer the question correctly while high achieving students do not. Most multiple-choice tests also contain about 20-30% of items that do not discriminate at all between high and low achieving students. Like essay tests, training and diligence can improve the validity of multiple-choice tests. Multiple-choice tests can be subjected to various forms of item analysis and refined to meed technical standards of validity.
The lack of validity in what 1906 tools measure is less subject to improvement.
The technical failures of low discrimination, negative discrimination, and low reliability are all low level failures of instruments that fail in a more significant way. Both essay and multiple-choice tests (although the problem is greater for multiple-choice) tend to measure the low end of taxonomies of learning; e.g., being able to recall facts and relations. While assessing at the low end of a learning taxonomy is seldom a good idea, this kind of assessment is understandable in courses for which the rationale for taking the course is to be able to recite facts and relations among facts.
The greater problem in relying on these tools to determine what students learned is that most courses aim for a higher purpose. Many courses seek to create in their students the ability to analyze, synthesize, and evaluate alternative courses of action with respect to a problem nexus. Essays and multiple-choice tools can poke around the edges of these dimensions of learning and, with exceptional effort, can get at them in limited ways, although few actually do; the design and validation efforts required are often beyond the capabilities of the person creating the test.
These old assessment tools fail at an even deeper level.
Increasingly, courses seek to develop proficiency in application, which is defined as specific behavioral competencies. The best way to assess at this deep level is through the use of hands-on (performative) tasks that are evaluated in multiple dimensions, in multiple ways, from multiple perspectives or sources. These forms of assessment work best if they are authentic, meaning that the structure and content of the assessment is faithful to or congruent with the way competence is evaluated in the target or end-state application, whether the workplace, family, church, or community. Drilling down deeply into authentic learning and assessment is out of scope for this Briefing except to note that I believe it will become the emerging form of teaching and evaluating.
Does It Matter?
- It matters that our professors teach the same way their professors taught them and, in turn, were taught by their professors. Fifty years of learning sciences have taught us more than a little about how to improve the efficiency and impact of teaching activities.
- It matters when a learning scientist teaches her students about research-based ways of teaching and learning that result in reduced time on task and improved retention and application, and yet teaches these scientific findings with pre-scientific methods that are less effective and efficient. This professor has missed an important opportunity to conduct field research.
- It matters when a measurement scientist determines grades based on a single high stakes multiple-choice final examination for which 20% of the test items discriminate negatively and 30% of the test items fail to discriminate above chance levels. This professor has ceded his right to make important distinctions among levels of achievement, arguably a hallmark of good teaching.
- It matters that the standard 1906 assessment tools measure little more than rote memory, and not all that well. Increasingly, the demands of the workplace go to day-of-hire behavioral competence.
- It matters that the correlation between 1906 assessment scores and subsequent workplace competence is not much better than chance after excluding the outliers at both ends of the distribution. Employers in many sectors have lost confidence in the predictive validity of grades and degrees.
- It matters when a teacher cannot beat chance levels in assigning the same grade or making the same comments on the same essay examination he graded six months earlier or graded the week before in a different presentation format. Students are sensitive to these manifestations of inequity. The impact can be demoralizing.
On the other hand, haven't we been managing college classrooms 1906-style since horseless carriages were all the rage. Learning takes place, and the system works even if it is inefficient. In addition, a growing number of one-off professors and a few programs, most notably in the medical and health sciences, have aligned themselves with modern learning and measurement sciences. We can be sure that others will follow.
Reasons to Hurry this Along
The answer to the "Why the rush?" question lies not only in our patience but in our vision of the urgent needs confronting higher education. Let me suggest two good reasons to rush.
- Migrating a degree program from 1906 to a well designed 2016 program would produce a four year degree in less than three years with proportionally less cost and superior outcomes. That extra time is available to be traded off between duration and intensity.
- The retention rates of programs based on learning and evaluation sciences are higher than those of 1906 programs. This is because the learning sciences have taught us something about the impact of engagement, horizontal learning, and authenticity. Remember, people like you tend to arrive with your own engagement in place. You do not represent the typical learner of 2016.
These two reasons among many may be reason enough to push this glacier along.