Skip navigation

Monthly Archives: March 2011

James Hunter, Assistant Professor – English Language Center, Gonzaga University, Spokane USA

Excel has an Analysis ToolPak which can do a lot of statistical tasks. Help on installing it is here. Also, try the R Project.  This is a free “software environment for statistical computing and graphics” and it will run on Windows, Mac, and Linux.  I haven’t had much of a chance to play with it, but it is certainly not user-friendly.  However, you can also get Statistical Lab, which is a GUI interface for R, also free but not for Mac or Linux. There’s also a free version of SPSS (the “big” stats package that businesses & colleges use), called PSPP.

With all of these, you can easily do correlation matrices, T-test, Chi-square, item analysis, Anova, etc. These will enable you to compare results on assessments, do pre- and post-tests, get inter-rater reliability information, find links between variables, etc.  See also this for information on which statistical procedures to use when.

I use mean and SD on most tests and quizzes to a) compare classes to previous semesters and b) look at the distribution and spread of scores on a test/item. This helps to make informed decisions about assessment instruments, especially those that might be adopted as standardized tests for the program. I’ve done a lot of work with our placement instruments, for example, to determine reliability and check our cut scores.

Recently, I’ve been doing research on corrective feedback in oral production, so have needed measures of accuracy and fluency (and complexity!). Statistical analysis has been essential to find correlations between, say, accuracy and reaction time on a grammaticality test and accuracy and production time in a correction test.  For instance, in class a student says to another: *”Yeah, actually I’m agree with you”. This goes down on a worksheet for her (and occasionally other classmates – see this for a description of this methodology), and she is later given a timed test in which she sees the incorrect sentence and has to record a corrected version. Her speed in doing this task (plus her accuracy) give a measure of whether this structure/lexis is part of her competence (or to use Krashen’s model, whether it has been “acquired” or “learned”: presumably, if this theory holds water, “learned” forms will take longer to process and produce than “acquired” ones). In addition to this production test, I’ve been doing a reaction time-test in which the same learner hears her own recording and has to decide, as quickly as possible, whether what she said is correct or not.  You can try this for yourself here (you will not be able to hear student recordings, only a few practice sets, recorded by me using student errors from our database; use anything as Username and “elc” as password).

These measures yield 1000s of results, and that’s why statistical analysis has been essential. Excel can do a lot of the work, especially in graphical representation, but SPSS has done most of the heavy lifting. For instance, it has revealed that there is no significant difference between the reaction time (or accuracy) when a student is listening to herself correcting an error she originally made and when she is listening to herself correcting errors made by classmates. In other words, students are just as good or bad at noticing and judging errors whether they made them or a classmate did. The same is true in the correction task described above.  This indicates that WHOSE error a student is correcting/judging has much less effect on her speed or accuracy than some other factor, e.g. the nature of the error itself. Probably a large “Duh!” factor there, but these things need to be ruled out before moving on…

Advertisements

By Peter Preston, Poland

Teachers do calculate the average score from tests, but then nothing serious is done with it. Even when the average score is close to the pass mark little statistical comment is made about the glaring problem that this represents. For example, if the average and the pass mark are the same and the population is normally distributed around the average, this means that 50% of the students fail. Can it be considered acceptable for 50% of the candidates to fail an end-of-the-year examination or even worse an end-of-the-course examination?

In fact at our college the last third-year UoE exam failed 80% of the students. Now you would think that a statistically-minded person would immediately start asking questions about validity of the exam. Construct validity – did the items set test the points intended to be tested? Course validity – did the items tested figure in the course syllabus? Is there a proper tie-up between the course syllabus and the test specifications (if the latter exist at all)? Did the distribution of correct responses discriminate between the weak and strong candidates? Were the items either too easy [not in this case] or too difficult? Is there any objective reference to competence standards built into the teaching programme? To ask just a few relevant questions.

I would love to hear that other institutions do use statistical analysis of exam data and look at the variance between different exam sittings using the same exam or different ones, but I wonder if small institutes can ever bring together the required expertese to carry out such work either before the exam goes live or afterwards. It would be great to conduct a poll on this matter to try to assess the use of statistics in the analysis of exam data at as many institutes as possible.

Peter Preston's students in Poland

My own experience inclines me to believe that exams are in fact not so much an educational evaluation of the work being done as a policy instrument to give face validity to the programme. As such one does not need to worry about the quality of the exam since one can adjust the results before publication. Or in the case of my institute the exam can be repeated by order from above until the teachers get the message.

I do not like the cynical manipulation of exam data, so having good quality statistical information and quality control of all documents involved in the course would be the start to a reevaluation of the course and teaching methods. By accurate assessment at the beginning of a course it should be possible to predict the level students could get to after a given number of teaching hours, taking into account the realities of life. By keeping proper statistical records over a few years one would accumulate powerful information. This is what insurance companies do to calculate their premiums.

By Amy Shipley, Academy of Art University, San Francisco, USA

We use a rubric that’s based on our course learning outcomes for all our writing and speaking assignments. I give them the rubric when I assign the task. I also put students in groups and give them two things: the rubric itself and a blank rubric. I have them paraphrase the requirements in each grading category so they fully understand what I’m looking for.

For speaking tasks, I videotape all formal presentations, so I have a record of what they’ve done. But it’s also for the students to evaluate (and grade) themselves. During the presentation, I also assign students in the audience to specific speakers, and have them evaluate and give feedback to them at the end of the presentation. When students grade themselves (which I check after I’ve graded them), I get feedback on my grading. It helps me to know if they understand the criteria and whether or not I explained it well.

By R. Michael Medley, Ph.D., Professor of English, Eastern Mennonite University

Two ways that I use the Academic Word List are as follows, the assumption being that this is some sort of English language development class for those who need English for academic purposes:

1. If the students are doing a reading which contains many unfamiliar words (but the reading is interesting to the students and helping them learn about something that they want to learn about), I might use the AWL to identify which words in the passage are more worth the students’ concentrated attention.  We all know that some words are of such low frequency that it is not worthwhile for learners to spend time working to incorporate those words into that active (or even passive) vocabulary.

But if some of the new words in the passage are on the AWL, then I can devise some kind of exercise or discussion that brings those words into focus and gives learners (a) additional multiple exposures to the words and (b) actual practice using them.

2. I am in the process of writing some ESOL materials based mainly on readings representing a unified content area.  I regularly use a vocabulary profiler, LexTutor,  to help me see the relative frequencies of the words that make up the passage.  This vocab profiler also identifies AWL words.  So if I am trying to simplify the text a little, I can simplify by changing the “off-list words” — that is those words of quite low frequency, which are not on the AWL.  I will certainly leave the AWL words in the text so that the students get exposed to them. Since most of the texts in my materials will be read by high intermediate or advanced students with instructor support (and not as extensive reading by the students independently) I feel that it is adequate if 90% of the vocabulary falls into the top 2000 words of English (usually that means about 80-85% of the words are in the top 1000).  The 10% of words not in the top 2000 will be AWL and low-frequency words.

A teacher who uses a lot of electronic texts with her/his learners, could easily use this vocabulary profiler to check on the presence of AWL words in the readings–in effect, guiding the choice of readings based on their vocabulary profiles and then guiding the teacher in choosing vocabulary to bring into focus either before or after the reading.

An interesting realization I’ve had in preparing these materials is that there is a lot of specialized vocabulary for the particular subject area with which I’m dealing. Now that I am working on chapter 12, it seems that the low-frequency vocabulary for one reading has grown very large. But when I look carefully at the words, I’ll see right away that many of these words have been introduced already and practiced many times through the previous 11 chapters.  This realization illustrates the value of doing extended reading (not exactly the same as extensive reading)–that is reading a lot in one subject area or becoming accustomed to the writing style (patterns of thought and expression) of one author.

By Erlyn Baack, now retired, formerly at ITESM, Campus Queretaro, Mexico http://eslbee.com

Both the IELTS and the TOEFL are proficiency tests that measure overall proficiency. They are both global in nature. I do not think they should be seen as achievement tests to be used at the end of a semester of study. Instead, they may be used to inform the achievement rubrics that should be developed within successive levels within an English program. Likewise, these proficiency exams should not be used as placement exams either because there are better placement exams available. There is not a single question on the TOEFL, for example, that discriminates the difference between English ONE, TWO, and THREE levels for instance. So for placement, even Michigan’s very old English Placement Test (if it is still available) would be better than the TOEFL for placement.

That said, the IELTS and the TOEFL should inform the achievement (and the rubrics in each of the four skills, ideally) that teachers and/or course administrators want to achieve at each level within an English program. Teachers and/or course administrators have to decide the curriculum at each level: For example, in developing the curriculum for English ONE, teachers and/or course administrators must ask and answer the following questions: At the end of the semester, (1) What do we want the students to know (or achieve, or be, or be able to do)?, (2) How are we going to teach it?, and (3) How are we going to test it?

Teachers and/or administrators are then responsible for designing a curriculum and an ACHIEVEMENT exam, _with rubric_, that measures the level of student achievement throughout the semester. By definition, all students should have the ability to STUDY or PRACTICE the curriculum within the semester that would lead to higher achievement scores meaning there would be a high correlation between (1) the number of hours a student studies and (2) his/her final semester score. Those achievement scores, then, would affect the TOEFL and the IELTS only indirectly.

I think it is helpful to distinguish between various exams and what they measure.

(1) Placement exams contain questions at all levels to place students within an English program. Michigan’s EPT is an example.

(2) Proficiency exams measure overall proficiency. The IELTS and TOEFL are examples, and they are used by universities, generally, to determine whether proficiency is sufficient for university studies.

(3) Achievement exams measure the level of student achievement within a semester of study. A major monthly exam, a mid-semester exam, or a final exam are examples of those. Did the student “achieve” what was supposed to have been taught and learned within a given week or month or semester?