Skip navigation

Category Archives: assessment

Below is a link to six speaking samples of two college students. What score would you give them?

6 samples rar

Please listen to each sample carefully and score each individually. They may have been made at different times.

For business topics the student is speaking from a prompt card that had a topic and the student had 3 minutes to prepare. The student had no access to any materials to prepare, only time to think and plan. For personal topics the answers were spontaneous to the questions.

Give one score per sample. If you want to use the IELTS scale, you can find the band descriptors here: IELTS Speaking Band Descriptors. Please say what scale you are using.

Generally, it is expected that students can speak about familiar topics, like family or friends, better than unfamiliar topics like business. So the difficulty of the task has to be considered in scoring.

Now you can try it: Can you rate speaking?

Please do not discuss your scores on the List until all of the scores have been published after the one month waiting period. If you have any questions or problems, please contact Dave Kees at: davekees[at]

Write your scores in the “Leave a comment” section on the left side of this page or click here: Comments. All score submissions will be withheld for one month and then published. This way submissions will not be influenced by previous submissions.

Your score:

What scale?:

Sample 1:
Sample 2:
Sample 3:
Sample 4:
Sample 5:
Sample 6:

(Special prize for the submissions that are closest to the average scores. The prize is Uncle Dave’s Tie Score Tie. Yes, now you can be the envy of your school and own one of these specially designed high-quality silk ties perfect for teachers who do oral English testing. While the student is talking, the teacher can adjust the gold tie clip up or down to indicate to the student how he is doing and as a reminder to the teacher of how the student performed! Note: This offer is void in countries outside of China and in all areas inside of China.)

James Hunter, Assistant Professor – English Language Center, Gonzaga University, Spokane USA

Excel has an Analysis ToolPak which can do a lot of statistical tasks. Help on installing it is here. Also, try the R Project.  This is a free “software environment for statistical computing and graphics” and it will run on Windows, Mac, and Linux.  I haven’t had much of a chance to play with it, but it is certainly not user-friendly.  However, you can also get Statistical Lab, which is a GUI interface for R, also free but not for Mac or Linux. There’s also a free version of SPSS (the “big” stats package that businesses & colleges use), called PSPP.

With all of these, you can easily do correlation matrices, T-test, Chi-square, item analysis, Anova, etc. These will enable you to compare results on assessments, do pre- and post-tests, get inter-rater reliability information, find links between variables, etc.  See also this for information on which statistical procedures to use when.

I use mean and SD on most tests and quizzes to a) compare classes to previous semesters and b) look at the distribution and spread of scores on a test/item. This helps to make informed decisions about assessment instruments, especially those that might be adopted as standardized tests for the program. I’ve done a lot of work with our placement instruments, for example, to determine reliability and check our cut scores.

Recently, I’ve been doing research on corrective feedback in oral production, so have needed measures of accuracy and fluency (and complexity!). Statistical analysis has been essential to find correlations between, say, accuracy and reaction time on a grammaticality test and accuracy and production time in a correction test.  For instance, in class a student says to another: *”Yeah, actually I’m agree with you”. This goes down on a worksheet for her (and occasionally other classmates – see this for a description of this methodology), and she is later given a timed test in which she sees the incorrect sentence and has to record a corrected version. Her speed in doing this task (plus her accuracy) give a measure of whether this structure/lexis is part of her competence (or to use Krashen’s model, whether it has been “acquired” or “learned”: presumably, if this theory holds water, “learned” forms will take longer to process and produce than “acquired” ones). In addition to this production test, I’ve been doing a reaction time-test in which the same learner hears her own recording and has to decide, as quickly as possible, whether what she said is correct or not.  You can try this for yourself here (you will not be able to hear student recordings, only a few practice sets, recorded by me using student errors from our database; use anything as Username and “elc” as password).

These measures yield 1000s of results, and that’s why statistical analysis has been essential. Excel can do a lot of the work, especially in graphical representation, but SPSS has done most of the heavy lifting. For instance, it has revealed that there is no significant difference between the reaction time (or accuracy) when a student is listening to herself correcting an error she originally made and when she is listening to herself correcting errors made by classmates. In other words, students are just as good or bad at noticing and judging errors whether they made them or a classmate did. The same is true in the correction task described above.  This indicates that WHOSE error a student is correcting/judging has much less effect on her speed or accuracy than some other factor, e.g. the nature of the error itself. Probably a large “Duh!” factor there, but these things need to be ruled out before moving on…

By Peter Preston, Poland

Teachers do calculate the average score from tests, but then nothing serious is done with it. Even when the average score is close to the pass mark little statistical comment is made about the glaring problem that this represents. For example, if the average and the pass mark are the same and the population is normally distributed around the average, this means that 50% of the students fail. Can it be considered acceptable for 50% of the candidates to fail an end-of-the-year examination or even worse an end-of-the-course examination?

In fact at our college the last third-year UoE exam failed 80% of the students. Now you would think that a statistically-minded person would immediately start asking questions about validity of the exam. Construct validity – did the items set test the points intended to be tested? Course validity – did the items tested figure in the course syllabus? Is there a proper tie-up between the course syllabus and the test specifications (if the latter exist at all)? Did the distribution of correct responses discriminate between the weak and strong candidates? Were the items either too easy [not in this case] or too difficult? Is there any objective reference to competence standards built into the teaching programme? To ask just a few relevant questions.

I would love to hear that other institutions do use statistical analysis of exam data and look at the variance between different exam sittings using the same exam or different ones, but I wonder if small institutes can ever bring together the required expertese to carry out such work either before the exam goes live or afterwards. It would be great to conduct a poll on this matter to try to assess the use of statistics in the analysis of exam data at as many institutes as possible.

Peter Preston's students in Poland

My own experience inclines me to believe that exams are in fact not so much an educational evaluation of the work being done as a policy instrument to give face validity to the programme. As such one does not need to worry about the quality of the exam since one can adjust the results before publication. Or in the case of my institute the exam can be repeated by order from above until the teachers get the message.

I do not like the cynical manipulation of exam data, so having good quality statistical information and quality control of all documents involved in the course would be the start to a reevaluation of the course and teaching methods. By accurate assessment at the beginning of a course it should be possible to predict the level students could get to after a given number of teaching hours, taking into account the realities of life. By keeping proper statistical records over a few years one would accumulate powerful information. This is what insurance companies do to calculate their premiums.

By Erlyn Baack, now retired, formerly at ITESM, Campus Queretaro, Mexico

Both the IELTS and the TOEFL are proficiency tests that measure overall proficiency. They are both global in nature. I do not think they should be seen as achievement tests to be used at the end of a semester of study. Instead, they may be used to inform the achievement rubrics that should be developed within successive levels within an English program. Likewise, these proficiency exams should not be used as placement exams either because there are better placement exams available. There is not a single question on the TOEFL, for example, that discriminates the difference between English ONE, TWO, and THREE levels for instance. So for placement, even Michigan’s very old English Placement Test (if it is still available) would be better than the TOEFL for placement.

That said, the IELTS and the TOEFL should inform the achievement (and the rubrics in each of the four skills, ideally) that teachers and/or course administrators want to achieve at each level within an English program. Teachers and/or course administrators have to decide the curriculum at each level: For example, in developing the curriculum for English ONE, teachers and/or course administrators must ask and answer the following questions: At the end of the semester, (1) What do we want the students to know (or achieve, or be, or be able to do)?, (2) How are we going to teach it?, and (3) How are we going to test it?

Teachers and/or administrators are then responsible for designing a curriculum and an ACHIEVEMENT exam, _with rubric_, that measures the level of student achievement throughout the semester. By definition, all students should have the ability to STUDY or PRACTICE the curriculum within the semester that would lead to higher achievement scores meaning there would be a high correlation between (1) the number of hours a student studies and (2) his/her final semester score. Those achievement scores, then, would affect the TOEFL and the IELTS only indirectly.

I think it is helpful to distinguish between various exams and what they measure.

(1) Placement exams contain questions at all levels to place students within an English program. Michigan’s EPT is an example.

(2) Proficiency exams measure overall proficiency. The IELTS and TOEFL are examples, and they are used by universities, generally, to determine whether proficiency is sufficient for university studies.

(3) Achievement exams measure the level of student achievement within a semester of study. A major monthly exam, a mid-semester exam, or a final exam are examples of those. Did the student “achieve” what was supposed to have been taught and learned within a given week or month or semester?

>By Maria Spelleri – Manatee Community College, Florida, USA

One way to get a sense of structure with the evaluation of student oral production is by using a rubric. Here’s an example of a speaking rubric for an ESL program in a US elementary school system: RUBRIC and here’s a site with programs to help you develop a rubric: DEVELOP RUBRIC

To create a rubric for a speaking activity such as retelling a story, you need to break the activity down into its most basic elements. For example, speech is comprised of vocabulary, grammar, pronunciation/stress/intonation, logical meaning and order, purpose, and in the case of the story, an element of cohesion. For the specific task, you might want to also consider the accuracy of the retelling, the amount of detail included, number or length of pauses and inappropriate filler noises, etc. Then, for each category, set the possible performance/assessment levels, for example, “excellent”, “satisfactory”, and “needs improvement”. I prefer to work with a basic set of 3 as it is easier for me to break down a production into bad, so-so, and good instead of more subtle variations- although plenty of instructors use 4 and 5 categories.

If you do a Google search using key words like “ESL Speaking rubric”, you should find many ideas to help you create a rubric that will meet your needs.

By the way, I would suggest recording the assessment either audio or audio/video because it can be hard to listen to content, mentally evaluate, and complete a rubric at the same time. Replaying the audio gives you time to better analyze the students’ work and assess more fairly. Playing back the recording for the student who can then watch him or herself and compare the recording to the completed rubric assessment is a valuable learning tool as well.

>By Nik Bramblett – UCF, Orlando FL, USA

Sometimes we need to evaluate L2 socialization skills using an alternative assessment and not a paper test.

Here’s what I would do:

(a) Work with students (using appropriate combination of whole group, breakout small-group, and/or individual/paired strategies) to develop a rubric for a role-playing activity. Discuss what “socialization skills” means and how you might measure mastery of them. Let the students decide what’s important and what they will be graded on (with appropriate guidance from you as necessary, of course).

(b) Have students work in pairs or trios for the assessment… students would randomly select a social problem-solving situation from a collection that you created on cards or whatever… “You need make an important call [make up a specific scenario] and your cell phone is dead; there are two strangers nearby [perhaps it’s a bus stop or whatever]. Interact with those people to solve your problem.” for example. Students would have a brief period to plan/rehearse, and would then more-or-less improv a scene.

(c) Both you and the student audience would use the rubric you designed together (and reviewed clearly and modeled and practiced before these presentations began) to measure the ability of the students to perform whatever specific tasks, roles, etc. you had decided were the measurable objectives. Students’ ability to effectively judge their peers’ performance would (rightly) be part of the grade. This would not only measure the mastery of the skills but also the metacognition behind the skills.

>By Noriko Ishihara – University of Minnesota, USA / Hosei University

[An excellent way to test students language abilities is in a realistic setting. But how can that be done? Noriko Ishihara explains.]

How to do a scenario-based assessment of socializing skills. In my view, it’s very close to assessing sociolinguistic/pragmatic ability, which has usually been done with a situational approach.

In this instruction and assessment, learner language is elicited using realistic scenarios and the teacher chooses from a range of language- and culture-focused features to assess, for example,

– directness, politeness, and formality
– organization/discourse structure
– language form, semantic strategies, word choice
– tone (verbal and non-verbal cues)
– understanding and use of sociocultural norms
– the extent to which the speaker’s intentions match the listener’s most likely interpretation

The selected feature(s) can be assessed using various rubrics and/or checklists by the teacher and learners themselves, which can be used as rather formal assessment or part of everyday instruction/informal assessment. If anyone is interested, I’d be happy to share a paper in press that details this approach with various sample scenarios, learners language, and sample assessment using authentic learner language.

>By Jennifer Wallace – Anhui Gongye Daxue, Ma’anshan, China

Lots of us are trying to develop tests appropriate for the situations we’re teaching in. One document I’d recommend, because I’ve found it enormously helpful, is the Council of Europe Framework, which is on the Internet, as a downloadable pdf file (for which you need to have Adobe Acrobat Reader on your machine). I like the document for several reasons.

The work behind it is the work of a large number of experts across Europe, who’ve developed one framework to cover the teaching (and testing) of any of the languages taught and used in Europe – which of course includes a variety of non-European languages. In other words, the whole thing is language independent. I understand it to be very much a reflection of the most up to date understanding we have of measuring language performance. The particular document in question is the latest version, the result of many revisions.

The document addresses the fundamental questions in all this, and looks at every dimension conceivable – so I can use it as a basis for testing speaking, listening, reading, anything. It looks at things on general levels and on detailed specific levels – so you can home in on the level that is relevant for you at the moment.

Because this framework is as comprehensive as it is, it lets me think up a variety of activities for the form of my tests, activities that reflect the students experiences and what they’ve done in a course. But at the same time it’s kept me very much on track, enabling me to see clearly what level our target it.

Because it’s not language-specific, you can test yourself (there’s one section on self-testing) for your Chinese to see how this sort of approach works.

Someone also commented about examiners’ ability not to be swayed – well, I think what allows me to be more objective is using a number of scales and criteria when I test. For example, this semester my college end-of-first-year students will get some marks for pronunciation (because we’ve done quite a bit of pronunciation work on their Oral English classes), some marks for fluency, some marks for grammar, some marks for vocabulary/lexis and some marks for coherence.

I’m also thinking about including some marks for how they deal with problems – repair work, asking for help, paraphrasing, miming, using fillers to gain thinking time and to fill a silence, and the suchlike – what’s called strategic competence. My criteria for vocab/lexis and grammar will not be whether they demonstrate use of anything in particular, but in how effective they are at communicating successfully -do their errors interfere with communication, or hinder it, or render it impossible! This is because I teach college English majors – I think testing for specific aspects of these dimensions is the responsibility of other teachers in other classes. but at the same time, my students do realise that I consider grammar and lexis to be seriously important.

As regards a quick test, my experience, and the experience of other testing large numbers quickly for summer schools (in UK language schools), is that in an informal chat of around 5 minutes, grading only on a 5 point scale (with very easy to understand scoring 5) is a remarkably effective tool in the hands of a native speaker. Even on the most mundane of topics (your home town, your family), it sorts the lower from the higher from the in betweens. I did this at the beginning of this year with my 225 new students, and on subsequent reflection, having taught them now for 2 semesters, remarkably few of my initial assessments were wrong, and none were way off.

What’s interesting is looking back at their subsequent development! The value for me is how much respect I have for the students who got a low rating at the beginning who would only now get a middle rating – but wow, what progress! In each band, I can see students who have really made big efforts and made progress, and I can also see students who’ve made almost no progress. Of those, a small number are not interested in the effort it entails (basketball etc is more important), but I also have one or two who I realise are making efforts but little progress. I think that initial testing and placement has really helped me, and I plan to do it for future Oral English classes. One thing I did was use the test results to make groups according to level, and that’s been very successful as well.

>Korean businesses are trusting TOEIC or TOEFL test scores less and realizing the importance of oral English interview more. Read about it here.

>By Eve Ross – Beijing Institute of Machinery, China

Exams are coming up around here, and I’ve decided that the test format for my Oral English classes will be a 10-minute one-on-one conversation between each student and myself. The criteria will be holistic: fluency and intelligibility.

When a student arrives to take the exam, I will present him/her with little slips of paper and on each one will be a topic we have discussed in class this term (shopping, sports, divorce, etc.). The student selects one topic at random, like drawing straws. After a moment’s reflection, the student must start the conversation with a question, such as “Do you like shopping?” or “What is your favorite sport?” or something a little more provocative, such as “Can divorce be a good thing?” and be ready for whatever answer I may give to that question.

The conversation must last 10 minutes, but can stray to other topics, if they come up naturally (as in, NOT “Let’s talk about sports now,” but maybe, “Do you think Michael Jordan likes to go shopping?”).

Obviously this is way more touchy-feely than my students are used to. But I think it will be a good measure of their ability to hold a conversation in English, as well as an experience that will give them confidence. I hope they leave the exam thinking, “Wow, I just talked for 10 minutes straight in English with a foreigner! And we understood each other!”

In classes to prepare my students for this kind of test, I’m using a variation of The Wheel (submitted by Rae). Students sit in a double circle, with those facing outward playing the teacher and those facing inward playing a student. I gave each “student” one of the exam topics on a slip of paper. They had 2 minutes to start the conversation with the “teacher” and have the “teacher” respond. Then the “students” handed the topic to the person on their right, and the “teachers” moved to the chair on their left. Thus everyone had a new partner and a new topic. They did the same role-play, except I gave them 3 minutes so they could elaborate a little more. We did several rotations, until they were talking for 10 minutes at a stretch.

If I noticed anyone not talking, I would go over to them and try to jumpstart their conversation. However, there was really a lot of English going on!