Forgot your password?
typodupeerror
Education Software

Essay Grading Software For Teachers 535

Posted by timothy
from the sounds-better-my-12th-grade-teacher dept.
asjk writes "Software to help teachers with grading has been around for sometime. This is true even with respect to grading essays. A new tool, called Criteria, will look at grammar, usage, and even style and organization. It works by being trained by at least 450 essays scored by two professionals. The difference this time? Here is a snip from the article: '"There's a lot of skepticism," Dr. Spatola said. "The people opposed see it dehumanizing the student's papers, putting them through some sort of mechanical, computerized system like the multiple choice tests. That's really not the case, because we're not talking about eliminating the human element. We're making the process more efficient."'"
This discussion has been archived. No new comments can be posted.

Essay Grading Software For Teachers

Comments Filter:
  • The GMAT, a test required to get into business school in the US, includes two 30-minute essay questions. Your responses are graded by a human grader and a computer program on a scale of 0 to 6. Your score is then a composite of the two scores.

    ETS actually has a web site where you can do a sample essay that their server will grade for you.

    More info can be found here [mba.com].

  • It's like the bayesian filter for mail classification in SpamBayes or Mozilla. In fact, that's probably where Criteria's programmers got their inspiration.

    If you read the article, you'll discover they had to feed it four hundred or so "good" papers (training set), and they describe it's validity because graders notice that (paraphrased) "well written papers [on the topic] contain certain key words or ideas, and avoid certain expressions [examples]", which the system picks up on. Since it agrees with grader scores +95% of the time, I think those simple indicators are actually pretty useful.

    Keep in mind, it can give a perfect score to unreadable garbage, which isn't even grammatically correct. (This is mentioned in the article)

    Nice 5 insightful, though. But next time, read the article.

    In fact, I'm ashamed no one mentioned that this is just like spam filter technology yet. Come on slashdot, is your technical insight on a weekend trip or what?
  • by AntiFreeze (31247) * <antifreeze42&gmail,com> on Sunday September 07, 2003 @12:25AM (#6891291) Homepage Journal
    Okay, this is going to be rather long, so please bear with me.

    First off, let me say that I am involved in the automated essay grading industry, and have helped to develop RocketScore [rocketreview.com] which does everything Criterion does, and lots more. Forgive me for blatant plugs in this post, I'll try and keep them to a minimum.

    But let's move on to the focus of this article.

    First off, there is a lot of criticism about essay graders being formulaic, only capable of seeing patterns that arose in their originating sample set of essays. With Criterion, an offshoot of ETS's e-rater, this is a serious concern. When you only look at what you see, anything out of left field looks completely awry, and cannot be graded appropriately. RocketScore is different; RocketScore uses a "features" method to check for included or excluded material, among many other things, and is therefore quite good at noticing subtle writing and essays types which it has never seen before.

    One of the great things about essay graders is that they give a student an objective standard to look to. Human graders grade differently based upon mood, time they have to review the writing, and many other mittigating factors. In other words, the same human grader might grade the same essay differently at separate points in time. Most essay graders will always grade the same essay in the same manner. This is great for a student, for if a teacher gives you a D when the essay grader says it's in B range, one might be able to use this evidence to force the teacher to reconsider the grade. Or vica versa. If the essay grader is telling you that you're getting a D, you can work and improve on it until you're getting that B you'd be happy with.

    But there are serious drawbacks to the comments E-Rater and Criterion give. E-Rater gives comments soley based on your score (if you get a 1, you get comment set 1, if you get a 2, comment set 2, etc.). Criterion gives a student "instructional feedback in basic grammar, usage, style and organization." E-Rater's comments are inadequate at best, and Criterion's leave a lot to be desired. RocketScore provides substantial feedback on how to improve your writing. Not just stylistic and grammatical comments, but comments on what you should be writing more about (you didn't provide enough info!), what you should be writing less about (you gave too much info!), and how to balance your arguments, among many other categories.

    There are two major problems with essay grading. The first is bullshit detection, and the second is determining if the essay actually answered the question asked. E-rater and Criterion both have real problems with these two criteria. With bullshit detection, RocketScore has threshholds which can be set and manipulated on the fly, from throwing out anything which isn't completely relevant to the topic, to allowing just about any essay submitted. And you will get a score and comments based upon what you submitted. Of course, these are most helpful when you make a meaningful attempt to submit a relevant essay.

    "The machine score and the human score are in agreement 97 percent to 98 percent of the time."

    Yes, but do you know how ETS defines "agreement"? Glad you asked. When the grader's grade is within a point of the human's grade. Now, with the SAT 2 test, which is on a scale of 1 through 6, that means if the grader says 2, and a human says 1, 2, or 3, then there's agreement. But that's 50% of the scale! Their essay grader has a 98% chance of hitting the wall in front of them as opposed to the wall next to them. Woohoo. Meanwhile, RocketScore provides decimal point accuracy (we don't give you a 4 or a 5, we give you a 4.1, or 5.3), and is 98% accurate. But how do we define accurate? When the grader's grade is rounded to the nearest whole number, and that number is the human's grade. In other words, if we give you a 4.3, there is a 98% chance a human would give you a 4. With 4.5,

  • by Anonymous Coward on Sunday September 07, 2003 @01:07AM (#6891406)
    Of course, ETS has yet to divulge the details of the technology they use.
    Could this be it?

    As I understand it, ETS poured tens of millions of dollars into the automated essay grading effort, in parallel with the development of the CBT format. For years after the CBT was introduced, the GRE essay was still done on paper.

    As the grading software finally worked on the database of sample essays, the GRE essay switched away from paper to word processing entry (something many test takers have difficulty with).

    Still, the essays were graded by armies of grad students. Only when the automated grading matched the manual grading 90% of the time did the software "go live".

    IIRC, it has been used for only a few years. After over a decade of development.

  • by ergo98 (9391) on Sunday September 07, 2003 @01:08AM (#6891409) Homepage Journal
    "I heard a statistic once that if you chose answers randomly on a MC test that you could get a C by not knowing anything beyond how to circle a letter!"

    You "heard a statistic once"? Geez, the probability statistics aren't that difficult: If there's 4 possible answers, and you randomly pick, you'll likely get about 25% right, or 5/20, 3/33. It isn't rocket science. To get 50% randomly there'd have to be only two possible choices. Add to that the fact that many post secondary multiple choice tests actually deduct marks for incorrect answers, and your C proclamation sounds like it might be incorrect.

  • Re:Interesting.. (Score:5, Informative)

    by dieman (4814) * on Sunday September 07, 2003 @01:33AM (#6891481) Homepage

    I took a old college paper [ringworld.org] that I wrote and plugged it into the program and got 100% on everything except for creativity (99.973). Considering that I don't think I got a 'perfect' score on this paper, I'm really surprised by the scores. :)

    How great though, throwing a paper about the fear of technology through something many people (rightfully) fear. :)

  • by ahfoo (223186) on Sunday September 07, 2003 @02:16AM (#6891588) Journal
    I wrote in my journal about this awhile back. ETS was trying to sell their essay grader to a group of the local test prep chains here in Taiwan. The local schools called me in to sit in on the presentation. Before I had gone in, I searched around and found numerous free and open implementations and I asked the speaker why they were selling their academic software for so much money --it was a rather complex contract on a per seat basis-- when there were similar product available for free. Their rep claimed to be unamare of any similar open sourced products that could match the amazing and advanced artificial intelligence features they were offering. Sales reps --hmm. The mere posing of question definitely made them stutter and squirm though.
    But the interesting part was after I got home. I looked at ETS's own research monologues and found that internally this overpriced system had been debunked. It was discovered that by writing one well-formed short paragraph and then cutting and pasting it over and over an almost perfect score could be attained. The more times it was pasted, the higher the score.
    It was also possible to write an essay on an unrelated topic and still get a high score allowing students to use rote memoriziation of a single model essay. This, natually, is impossible with a human reader because they can tell what the topic is fairly easily. According to the sales literature this software could to, but in actual tests that didn't hold up.
    Their sales literature claimed that the software contained aritificial intelligence and thus implied that such simple techniques would not fool it, but in practice this was far from the case.
    Monographs published by ETS also made it clear that despite their aggressive marketing of this product outside the US, they were not planning to use it as an exclusive grading system on their own tests. Rather, it was to be used as a teaching tool. However, it took a lot of digging to uncover that information.
    Just as with translation, there's a lot of financial motivation to make this technology work, but that doesn't necessarily translate into workable products. In the nineties when spelling and grammar checking was already old hat and English/Euro translation was making such headway I thought fluent Chinese/English translation was just a few years away. Now it's 2003, grammar checkers still only work if you write in prescribed style and I've yet to see something halfway decent in Chinese/English translation software although you still hear claims all the time for some overpriced product that's really almost there.
    I think we'll see dramatic life extension long before we see decent computer essay graders. Decent trade as far as I'm concerned. As for translation, we can always teach more languages in school.
  • Re:Interesting.. (Score:5, Informative)

    by clifyt (11768) <(moc.liamg) (ta) (rettamkinos)> on Sunday September 07, 2003 @09:41AM (#6892466) Homepage
    Read what the model is about before complaining :)

    That model that is up there is one based on Impromptu Entering Student Essays.

    For this model, we were giving students 1 hour to write an essay that they had no prior knowledge of the prompt. We allowed no research or even simple things like spell checking (we did provide hard dictionarys :-)

    As such, anything that was well researched and otherwise would have probably thrown this thing off the charts.

    We *DO* have several other models available. The best example of this technology was taken off the site a few weeks ago at the behest of a former partner in this research at Duke University. We DID have several models that could have been compared including one that was appropriate for many types of research papers.

    Remember -- folks are afraid this stuff is going to take away humanity *BUT* no one wants to even thing that this stuff is customizable for target groups. With as small as 300 papers that were rated (notice I try to NEVER say graded...though even after 10 years at this stuff its hard not to...) we could set up initial models for an individual school system with their own ruberics and scored according to their skill levels. Of course, the model would HAVE to be refined for later usage, but thats enough to get started.

    The great thing about this is at a production level, we actually screen for essays that are rated much higher or much lower than the standard deviations would allow for. It allows us to take a look at whats going on and make adjustments.

    It also allows for diagnostic use for educators. For instance, my incoming students all have to write essays when they come in (unless they have taken a honors level writting course in high school and have received college credit). This is all automated (on another system farther behind my line of defenses ya hackers :-) in that they come in, we give them a prompt to write about and they type it in (or if they are afraid of computers, write it in a blue book...we ain't nazis about this technology -- but that will take 3 weeks longer as our raters don't stop by campus too often). Its then transmitted to the student databases and we've provided an interface for the English faculty to rate these things.

    *IF* the paper is written at a much higher threshold than is expected for a student of that calibre, I automatically kick off an email to the rater in charge of the honors program asking her to take a look at it. If its much lower, the application tries to make a good first judgement if this is a remedial case (which most of mine show up as :-) or an ESL case (English as a Second Language) and then we kick off the appropriate emails.

    This *ALSO* happens with human raters...the first rater to look at the essay has the choice of throwing it one way or another (actually she can alert ALL of the parties if it was necessary) and it does the same thing...but the automated part saves a few days of this initial interaction.

    Just as a note: If someone had gotten this far in the college application, we aren't here to make any judgements on their ability to be a college student, we are interested in making the most appropriate assessment in where they should be placed to get the best help so that they can have the best college experience around. This application was a good help with making sure that this was achieved.

    We stopped using this in production a while back after protests from folks that didn't know how it worked nor cared to understand that it wasn't out to take their jobs. It was there to help make sure that a SINGLE judgement on the human side was correct (or within a certain scope of correctness) and if not, ask that someone else give it a second look. Back in the day threee raters would have rated any given essay for student placement purposes, but even before this was introduced, it got to the point where depending on the attitudes of those rati

The meta-Turing test counts a thing as intelligent if it seeks to devise and apply Turing tests to objects of its own creation. -- Lew Mammel, Jr.

Working...