Skip navigation

Category Archives: testing

Below is a link to six speaking samples of two college students. What score would you give them?

6 samples rar

Please listen to each sample carefully and score each individually. They may have been made at different times.

For business topics the student is speaking from a prompt card that had a topic and the student had 3 minutes to prepare. The student had no access to any materials to prepare, only time to think and plan. For personal topics the answers were spontaneous to the questions.

Give one score per sample. If you want to use the IELTS scale, you can find the band descriptors here: IELTS Speaking Band Descriptors. Please say what scale you are using.

Generally, it is expected that students can speak about familiar topics, like family or friends, better than unfamiliar topics like business. So the difficulty of the task has to be considered in scoring.

Now you can try it: Can you rate speaking?

Please do not discuss your scores on the List until all of the scores have been published after the one month waiting period. If you have any questions or problems, please contact Dave Kees at: davekees[at]gmail.com.

Write your scores in the “Leave a comment” section on the left side of this page or click here: Comments. All score submissions will be withheld for one month and then published. This way submissions will not be influenced by previous submissions.

Your score:

What scale?:

Sample 1:
Sample 2:
Sample 3:
Sample 4:
Sample 5:
Sample 6:

(Special prize for the submissions that are closest to the average scores. The prize is Uncle Dave’s Tie Score Tie. Yes, now you can be the envy of your school and own one of these specially designed high-quality silk ties perfect for teachers who do oral English testing. While the student is talking, the teacher can adjust the gold tie clip up or down to indicate to the student how he is doing and as a reminder to the teacher of how the student performed! Note: This offer is void in countries outside of China and in all areas inside of China.)

James Hunter, Assistant Professor – English Language Center, Gonzaga University, Spokane USA

Excel has an Analysis ToolPak which can do a lot of statistical tasks. Help on installing it is here. Also, try the R Project.  This is a free “software environment for statistical computing and graphics” and it will run on Windows, Mac, and Linux.  I haven’t had much of a chance to play with it, but it is certainly not user-friendly.  However, you can also get Statistical Lab, which is a GUI interface for R, also free but not for Mac or Linux. There’s also a free version of SPSS (the “big” stats package that businesses & colleges use), called PSPP.

With all of these, you can easily do correlation matrices, T-test, Chi-square, item analysis, Anova, etc. These will enable you to compare results on assessments, do pre- and post-tests, get inter-rater reliability information, find links between variables, etc.  See also this for information on which statistical procedures to use when.

I use mean and SD on most tests and quizzes to a) compare classes to previous semesters and b) look at the distribution and spread of scores on a test/item. This helps to make informed decisions about assessment instruments, especially those that might be adopted as standardized tests for the program. I’ve done a lot of work with our placement instruments, for example, to determine reliability and check our cut scores.

Recently, I’ve been doing research on corrective feedback in oral production, so have needed measures of accuracy and fluency (and complexity!). Statistical analysis has been essential to find correlations between, say, accuracy and reaction time on a grammaticality test and accuracy and production time in a correction test.  For instance, in class a student says to another: *”Yeah, actually I’m agree with you”. This goes down on a worksheet for her (and occasionally other classmates – see this for a description of this methodology), and she is later given a timed test in which she sees the incorrect sentence and has to record a corrected version. Her speed in doing this task (plus her accuracy) give a measure of whether this structure/lexis is part of her competence (or to use Krashen’s model, whether it has been “acquired” or “learned”: presumably, if this theory holds water, “learned” forms will take longer to process and produce than “acquired” ones). In addition to this production test, I’ve been doing a reaction time-test in which the same learner hears her own recording and has to decide, as quickly as possible, whether what she said is correct or not.  You can try this for yourself here (you will not be able to hear student recordings, only a few practice sets, recorded by me using student errors from our database; use anything as Username and “elc” as password).

These measures yield 1000s of results, and that’s why statistical analysis has been essential. Excel can do a lot of the work, especially in graphical representation, but SPSS has done most of the heavy lifting. For instance, it has revealed that there is no significant difference between the reaction time (or accuracy) when a student is listening to herself correcting an error she originally made and when she is listening to herself correcting errors made by classmates. In other words, students are just as good or bad at noticing and judging errors whether they made them or a classmate did. The same is true in the correction task described above.  This indicates that WHOSE error a student is correcting/judging has much less effect on her speed or accuracy than some other factor, e.g. the nature of the error itself. Probably a large “Duh!” factor there, but these things need to be ruled out before moving on…

By Peter Preston, Poland

Teachers do calculate the average score from tests, but then nothing serious is done with it. Even when the average score is close to the pass mark little statistical comment is made about the glaring problem that this represents. For example, if the average and the pass mark are the same and the population is normally distributed around the average, this means that 50% of the students fail. Can it be considered acceptable for 50% of the candidates to fail an end-of-the-year examination or even worse an end-of-the-course examination?

In fact at our college the last third-year UoE exam failed 80% of the students. Now you would think that a statistically-minded person would immediately start asking questions about validity of the exam. Construct validity – did the items set test the points intended to be tested? Course validity – did the items tested figure in the course syllabus? Is there a proper tie-up between the course syllabus and the test specifications (if the latter exist at all)? Did the distribution of correct responses discriminate between the weak and strong candidates? Were the items either too easy [not in this case] or too difficult? Is there any objective reference to competence standards built into the teaching programme? To ask just a few relevant questions.

I would love to hear that other institutions do use statistical analysis of exam data and look at the variance between different exam sittings using the same exam or different ones, but I wonder if small institutes can ever bring together the required expertese to carry out such work either before the exam goes live or afterwards. It would be great to conduct a poll on this matter to try to assess the use of statistics in the analysis of exam data at as many institutes as possible.

Peter Preston's students in Poland

My own experience inclines me to believe that exams are in fact not so much an educational evaluation of the work being done as a policy instrument to give face validity to the programme. As such one does not need to worry about the quality of the exam since one can adjust the results before publication. Or in the case of my institute the exam can be repeated by order from above until the teachers get the message.

I do not like the cynical manipulation of exam data, so having good quality statistical information and quality control of all documents involved in the course would be the start to a reevaluation of the course and teaching methods. By accurate assessment at the beginning of a course it should be possible to predict the level students could get to after a given number of teaching hours, taking into account the realities of life. By keeping proper statistical records over a few years one would accumulate powerful information. This is what insurance companies do to calculate their premiums.

By Erlyn Baack, now retired, formerly at ITESM, Campus Queretaro, Mexico http://eslbee.com

Both the IELTS and the TOEFL are proficiency tests that measure overall proficiency. They are both global in nature. I do not think they should be seen as achievement tests to be used at the end of a semester of study. Instead, they may be used to inform the achievement rubrics that should be developed within successive levels within an English program. Likewise, these proficiency exams should not be used as placement exams either because there are better placement exams available. There is not a single question on the TOEFL, for example, that discriminates the difference between English ONE, TWO, and THREE levels for instance. So for placement, even Michigan’s very old English Placement Test (if it is still available) would be better than the TOEFL for placement.

That said, the IELTS and the TOEFL should inform the achievement (and the rubrics in each of the four skills, ideally) that teachers and/or course administrators want to achieve at each level within an English program. Teachers and/or course administrators have to decide the curriculum at each level: For example, in developing the curriculum for English ONE, teachers and/or course administrators must ask and answer the following questions: At the end of the semester, (1) What do we want the students to know (or achieve, or be, or be able to do)?, (2) How are we going to teach it?, and (3) How are we going to test it?

Teachers and/or administrators are then responsible for designing a curriculum and an ACHIEVEMENT exam, _with rubric_, that measures the level of student achievement throughout the semester. By definition, all students should have the ability to STUDY or PRACTICE the curriculum within the semester that would lead to higher achievement scores meaning there would be a high correlation between (1) the number of hours a student studies and (2) his/her final semester score. Those achievement scores, then, would affect the TOEFL and the IELTS only indirectly.

I think it is helpful to distinguish between various exams and what they measure.

(1) Placement exams contain questions at all levels to place students within an English program. Michigan’s EPT is an example.

(2) Proficiency exams measure overall proficiency. The IELTS and TOEFL are examples, and they are used by universities, generally, to determine whether proficiency is sufficient for university studies.

(3) Achievement exams measure the level of student achievement within a semester of study. A major monthly exam, a mid-semester exam, or a final exam are examples of those. Did the student “achieve” what was supposed to have been taught and learned within a given week or month or semester?

By Jennifer Wallace
Anhui Gongye Daxue, Ma’anshan, China

Lots of us are trying to develop tests appropriate for the situations we’re teaching in. One document I’d recommend, because I’ve found it enormously helpful, is the Council of Europe Frameowrk, which is on the Internet, as a downloadable pdf file (for which you need to have Adobe Acrobat Reader on your machine). I like the document for several reasons.

The work behind it is the work of a large number of experts across Europe, who’ve developed one framework to cover the teaching (and testing) of any of the languages taught and used in Europe – which of course includes a variety of non-European languages. In other words, the whole thing is language independent. I understand it to be very much a reflection of the most up to date understanding we have of measuring language performance. The particular document in question is the latest version, the result of many revisions.

The document addresses the fundamental questions in all this, and looks at every dimension conceivable – so I can use it as a basis for testing speaking, listening, reading, anything. It looks at things on general levels and on detailed specific levels – so you can home in on the level that is relevant for you at the moment.

Because this framework is as comprehensive as it is, it’s let me think up a variety of activities for the form of my tests, activities that reflect the students experiences and what they’ve done in a course. But at the same time it’s kept me very much on track, enabling me to see clearly what level our target it.

Because it’s not language-specific, you can test yourself (there’s one section on self-testing) for your Chinese to see how this sort of approach works.

Someone also commented about examiners’ ability not to be swayed – well, I think what allows me to be more objective is using a number of scales and criteria when I test. For example, this semester my college end-of-first-year students will get some marks for pronunciation (because we’ve done quite a bit of pronunciation work on their Oral English classes), some marks for fluency, some marks for grammar, some marks for vocabulary/lexis and some marks for coherence. I’m also thinking about including some marks for how they deal with problems – repair work, asking for help, paraphasing, miming, using fillers to gain thinking time and to fill a silence, and the suchlike – what’s called strategic competence. My criteria for vocab/lexis and grammar will not be whether they demonstrate use of anything in particular, but in how effective they are at communicating successfully – do their errors interfere with communication, or hinder it, or render it impossible! This is because I teach college English majors – I think testing for specific aspects of these dimensions is the responsibility of other teachers in other classes. but at the same time, my students do realise that I consider grammar and lexis to be seriously important.

As regards a quick test, my experience, and the experience of other testing large numbers quickly for summer schools (in UK language schools), is that in an informal chat of around 5 minutes, grading only on a 5 pint scale (with very easy to understand scoring 5) is a remarkably effective tool in the hands of a native speaker. Even on the most mundane of topics (your home town, your family), it sorts the lower from the higher from the in betweens. I did this at the beginning of this year with my 225 new students, and on subsequent reflection, having taught them now for 2 semesters, remarkably few of my initial assessments were wrong, and none were way off. What’s interesting is looking back at their subsequent development! The value for me is how much respect I have for the students who got a low rating at the beginning who would only now get a middle rating – but wow, what progress! In each band, I can see students who have really made big efforts and made progress, and I can also see students who’ve made almost no progress. Of those, a small number are not interested in the effort it entails (basketball etc is more important), but I also have one or two who I realise are making efforts but little progress. I think that initial testing and placement has really helped me, and I plan to do it for future Oral English classes.

One thing I did was use the test results to make groups according to level, and that’s been very successful as well.

By George

To accurately test my students, I give them oral exams which are recorded on tape. These exams have two parts. The first part is Q&A covering things we have covered in class. They almost always have a memorized response for the basic questions. I tend to ignore these. I focus on their responses to the followup questions. For example, I’ve told them that we might discuss their grandparents, so I might ask

“Are your grandparents alive?” “How many children did they have?” How many boys and how many girls.? “Do you know your aunts and uncles?” “O.K let’s talk about your youngest aunt.” Here is where they begin to breakdown because they didn’t think to prepare for a discussion about their youngest aunt. I’ve also begun asking about a favorite middle-school teacher and them focus on the teacher they liked the least. Once I gotten to the real subject I’ll begin with what is the person’s name, age etc. and gradually lead to more complex questions. Then I start looking for syntactic, grammatical and vocabulary failure. In many cases the exam has ended in 2 or 3 minutes and some have gone as long as 30 or 40 minutes. In all cases I use subjects they are familiar with. Family, School, Friends and Hometowns. If I knew more about sports I would dwell on that. I have been know to ask a student to explain what a mid-fielder, a striker or a goalie does if they play those positions in football or the role of Guards, the Center or Forwards in basketball. I’ve even asked guitar playing students to explain how to play a particular song. In short they give me a guitar lesson.

To test for middle school, determine what is grade appropriate and start from there.

Again, start simple and progress to the complex. At what level do they abandon an answer or the topic entirely. The second part is a short oral reading which incorporates most of the English phonemes. I sometimes give the samples to practice with but they get a new reading for the exam. The must read cold.

Also, I’ve just begun developing a set of reading passages that will begin at about fifth or sixth grade level for native speakers using Flesch-Kincaide RGL measures and which become progressively more advanced. This way I can determine the level at which they begin to break down, identified by their rate of word abandonment. In the first year I will be mainly concerned with phonetic identification and production. As we progress, stress and intonation will become more of a factor.

Oral exams can be quantified, but I don’t like using them as the basis for a grade. I tell the school that grades should be considered as a report of a student’s speaking level and how much they have improved. In my classes, the only ones who actually fail are those who only show up for exams and the rare film. Those who come to class but aren’t there count as absent. Our school weeds them out pretty quick. Last term eight of my students flunked out including two who were pretty good English speakers. Six were expelled for cheating on Chinese teacher’s exams.

By Jennifer Wallace – Anhui Gongye Daxue, Ma’anshan City, Anhui Province, China

When I came to teach here, although I‘d been a speaking test examiner for more than 10 years (for UCLES exams) I‘d actually never had to set an oral English exam before. I’d taught always in situations where the students were either taking no exam or were working towards an external exam. So if I did have to set tests, they were very much on the mock-exam model. I’d never taught a modern language within a university/college setting where this was the student’s main subject (major). Although I had a short training specific to coming to this post in China, provided by the NGO who sponsor my post, I came with some sort of assumption that there would be a syllabus, there would be designated attainment targets (although not necessarily expressed in that way). Well, you all know the reality here. I was timetabled for first year Oral English classes who were provided with one of the ORAL ENGLISH WORKSHOP series of books. If anything, I found that was worse than arriving to nothing. It implied someone somewhere thought the content of this course book was what my students should be mastering.

Anyway, after a semester of muddling along and getting some sort of impression about what might be possible, I realised that the lowest of the UCLES EFL exams I’d been a speaking test examiner for was probably within the reach of everyone in the class. I’d been warned about the tradition of everyone in the class passing the exams. Remember I’d done those UCLES tests for years. I could remember the type of tasks set in the exam, and I produced a parallel. Those UCLES tests are taken in pairs, but I chose to give each student an individual exam – partly as a public relations exercise about oral exams within my department. I was interlocuter as well as assessor. So I recorded all the exams and marked them from the tapes. I was right in that all my students were capable of attaining that first level in the UCLES hierarchy, which means that in a grander scheme of things they had all achieved the Council of Europe Basic User level. The descriptors for this (in summary) are:

Can understand sentences and frequently used expressions related to areas of most immediate relevance (e.g. very personal and family information, shopping, local geography, environment).

Can communicate in simple and routine tasks requiring a simple and direct exchange of information on familiar and routine matters.

Can describe in simple terms aspects of his/her background, immediate environment and matters of immediate need.

At the end of the first semester all my students could do that – although a good number could only just do it with a very sympathetic interlocutor. Others walked through it. Which gave me a good spread of marks. And on that basis I decided to model the exam at the end of the second semester on the next level up (which I’d also done examining work for). By that stage my classes had included a fair amount of group work, and so the exam was done in small, randomly selected groups of 4, not including me. This year I’ve done lots more group work, but am actually planning to give one-to-one exams at this stage instead – partly for comparison.

So my decisions were based on a combination of what was within my own capabilities as well as the students. I’m not an expert on language testing. My only teacher training is a CELTA, and in the past when I’ve had to devise and construct college tests it was done under the supervision of a very experienced head of department. But also I’m not into re-inventing the wheel. The Council of Europe stuff – which relates to ALL the languages taught in Europe (and that includes teaching non-European languages) – is the result of mega-input from experts over heaven knows how many years now. I feel I’d be deluding myself if I thought I could devise any better sort of structure to work within – so I’m using it. I do also like it – I find it clear and easy to get my head around.

>By Erlyn Baack – ITESM, Campus Queretaro, Mexico

A teacher asked about “Listening/Speaking exams to assess low intermediate students if they are ready for high intermediate level and then high intermediate students if they are ready for advanced level of ESL. Putting an emphasis on better academic preparation, in Listening we stress listening to lectures and note-taking, and in Speaking, we stress academic vocabulary, grammar correctness, and presentation techniques. We consider (1) exam content, (2) testing efficiency considering the number of students, (3) a rubric for Listening/Speaking, (4) involvement of outside instructors to make testing more objective.

The teacher asked a whole series of interrelated questions involving testing of 40 low intermediate and 80 high intermediate students, and while the questions were specifically about “Listening/Speaking exams, the emphasis is much broader including listening and note-taking and academic vocabulary, grammar correctness, and presentation techniques.

Those questions, therefore, are all-encompassing questions that must be asked early on in the semester or even before the semester begins.

Teachers at both levels and the school administration should ask themselves (1) What do we want the students at each level to know (or be able to do) when they finish the level, (2) how are we going to teach it, and (3) how are we going to test it? Essentially, therefore, teachers at both levels should be (or should have been) addressing these questions in their lesson plans and classroom activities since the beginning of the semester. If daily or weekly classroom activities are designed for students to achieve the successes desired by the time of the final exam, their probability of success is high.

It would seem to me that listening to lectures and note-taking would be the easiest to teach and test although it takes collaboration among teachers and administrators on setting it up. For a simple classroom activity, I would recommend going to a website like Living on Earth at http://loe.org , download some MP3 files of radio stories, play them for the class while they take notes, and then test using ten multiple choice questions, for example. (Parenthetically, it is EASY to get written permission from LOE to use their MP3s in this manner, even including putting their files on CDs for the reserve in the library or for free distribution.) After the quiz (or before in some instances) students at both levels should have the opportunity to discuss the stories because they are interesting, sometimes polemic, and they can generate a lot of discussion. The last radio show is at consisting of seven stories ranging from three minutes to 17 minutes, and finally, an additional benefit is that the text is always available to these MP3 stories as well.

Regarding testing, to be fair, academic vocabulary and grammar correctness can be taken from materials (like Living on Earth) that students have already been exposed to. The text can be used for that, and at the levels Inna Braginsky is asking about, students can always benefit from focused attention toward S-V agreement, pronouns, modals, simple, compound, and complex sentences, adjective clauses, parallel structure, etc. It is up to the teachers and administrators to address the most obvious weaknesses at each level and focus on a few specifics students can study, so when they take their multiple choice texts on these things, they can succeed. (No apologies for multiple choice tests here! 😉 )

The hardest part is the speaking exam because it requires an enormous amount of consensus among TRAINED teachers.

First, teachers must agree on a series of possible questions or topics which may be announced beforehand or not, their preference. I think to be reliable the topic should come from topics students have already been exposed to, much like the Living on Earth topics suggested above (a dozen topics would not be too many for students to prepare for). After students are given three or four minutes to present a (1) SUMMARY or an (2) ARGUMENT for something or a (3) RATIONALE against something (for examples), teachers should ask a pertinent follow-up question or two, and both teachers/administrators and students should at least feel that the speech topic was fair; it was a topic the student was at least exposed to and had the chance to study.

Then, teachers should assess the speaking abilities of the students (both the prepared and extemporaneous parts), and ideally they will have worked together long enough with the parts of their rubric to arrive at a consensus independently. If four teachers, for example, assign the student a four of six points on an aspect of his speaking performance, they should be very proud of that!

I didn’t address bringing a bunch of non-ESL-trained teachers into the mix, but their opinions count! Spaces both formal and informal should be available for “mainstream” teachers to explain the “greatest weaknesses” of the ESL students, but I wouldn’t invite them to exams.

>By Maria Spelleri – Manatee Community College, Florida, USA

One way to get a sense of structure with the evaluation of student oral production is by using a rubric. Here’s an example of a speaking rubric for an ESL program in a US elementary school system: RUBRIC and here’s a site with programs to help you develop a rubric: DEVELOP RUBRIC

To create a rubric for a speaking activity such as retelling a story, you need to break the activity down into its most basic elements. For example, speech is comprised of vocabulary, grammar, pronunciation/stress/intonation, logical meaning and order, purpose, and in the case of the story, an element of cohesion. For the specific task, you might want to also consider the accuracy of the retelling, the amount of detail included, number or length of pauses and inappropriate filler noises, etc. Then, for each category, set the possible performance/assessment levels, for example, “excellent”, “satisfactory”, and “needs improvement”. I prefer to work with a basic set of 3 as it is easier for me to break down a production into bad, so-so, and good instead of more subtle variations- although plenty of instructors use 4 and 5 categories.

If you do a Google search using key words like “ESL Speaking rubric”, you should find many ideas to help you create a rubric that will meet your needs.

By the way, I would suggest recording the assessment either audio or audio/video because it can be hard to listen to content, mentally evaluate, and complete a rubric at the same time. Replaying the audio gives you time to better analyze the students’ work and assess more fairly. Playing back the recording for the student who can then watch him or herself and compare the recording to the completed rubric assessment is a valuable learning tool as well.

>By Nik Bramblett – UCF, Orlando FL, USA

Sometimes we need to evaluate L2 socialization skills using an alternative assessment and not a paper test.

Here’s what I would do:

(a) Work with students (using appropriate combination of whole group, breakout small-group, and/or individual/paired strategies) to develop a rubric for a role-playing activity. Discuss what “socialization skills” means and how you might measure mastery of them. Let the students decide what’s important and what they will be graded on (with appropriate guidance from you as necessary, of course).

(b) Have students work in pairs or trios for the assessment… students would randomly select a social problem-solving situation from a collection that you created on cards or whatever… “You need make an important call [make up a specific scenario] and your cell phone is dead; there are two strangers nearby [perhaps it’s a bus stop or whatever]. Interact with those people to solve your problem.” for example. Students would have a brief period to plan/rehearse, and would then more-or-less improv a scene.

(c) Both you and the student audience would use the rubric you designed together (and reviewed clearly and modeled and practiced before these presentations began) to measure the ability of the students to perform whatever specific tasks, roles, etc. you had decided were the measurable objectives. Students’ ability to effectively judge their peers’ performance would (rightly) be part of the grade. This would not only measure the mastery of the skills but also the metacognition behind the skills.