Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Education Software

Essay Grading Software For Teachers 535

asjk writes "Software to help teachers with grading has been around for sometime. This is true even with respect to grading essays. A new tool, called Criteria, will look at grammar, usage, and even style and organization. It works by being trained by at least 450 essays scored by two professionals. The difference this time? Here is a snip from the article: '"There's a lot of skepticism," Dr. Spatola said. "The people opposed see it dehumanizing the student's papers, putting them through some sort of mechanical, computerized system like the multiple choice tests. That's really not the case, because we're not talking about eliminating the human element. We're making the process more efficient."'"
This discussion has been archived. No new comments can be posted.

Essay Grading Software For Teachers

Comments Filter:
  • The GMAT, a test required to get into business school in the US, includes two 30-minute essay questions. Your responses are graded by a human grader and a computer program on a scale of 0 to 6. Your score is then a composite of the two scores.

    ETS actually has a web site where you can do a sample essay that their server will grade for you.

    More info can be found here [mba.com].

  • by pr0ntab ( 632466 ) <pr0ntab AT gmail DOT com> on Sunday September 07, 2003 @12:01AM (#6891213) Journal
    It's like the bayesian filter for mail classification in SpamBayes or Mozilla. In fact, that's probably where Criteria's programmers got their inspiration.

    If you read the article, you'll discover they had to feed it four hundred or so "good" papers (training set), and they describe it's validity because graders notice that (paraphrased) "well written papers [on the topic] contain certain key words or ideas, and avoid certain expressions [examples]", which the system picks up on. Since it agrees with grader scores +95% of the time, I think those simple indicators are actually pretty useful.

    Keep in mind, it can give a perfect score to unreadable garbage, which isn't even grammatically correct. (This is mentioned in the article)

    Nice 5 insightful, though. But next time, read the article.

    In fact, I'm ashamed no one mentioned that this is just like spam filter technology yet. Come on slashdot, is your technical insight on a weekend trip or what?
  • Okay, this is going to be rather long, so please bear with me.

    First off, let me say that I am involved in the automated essay grading industry, and have helped to develop RocketScore [rocketreview.com] which does everything Criterion does, and lots more. Forgive me for blatant plugs in this post, I'll try and keep them to a minimum.

    But let's move on to the focus of this article.

    First off, there is a lot of criticism about essay graders being formulaic, only capable of seeing patterns that arose in their originating sample set of essays. With Criterion, an offshoot of ETS's e-rater, this is a serious concern. When you only look at what you see, anything out of left field looks completely awry, and cannot be graded appropriately. RocketScore is different; RocketScore uses a "features" method to check for included or excluded material, among many other things, and is therefore quite good at noticing subtle writing and essays types which it has never seen before.

    One of the great things about essay graders is that they give a student an objective standard to look to. Human graders grade differently based upon mood, time they have to review the writing, and many other mittigating factors. In other words, the same human grader might grade the same essay differently at separate points in time. Most essay graders will always grade the same essay in the same manner. This is great for a student, for if a teacher gives you a D when the essay grader says it's in B range, one might be able to use this evidence to force the teacher to reconsider the grade. Or vica versa. If the essay grader is telling you that you're getting a D, you can work and improve on it until you're getting that B you'd be happy with.

    But there are serious drawbacks to the comments E-Rater and Criterion give. E-Rater gives comments soley based on your score (if you get a 1, you get comment set 1, if you get a 2, comment set 2, etc.). Criterion gives a student "instructional feedback in basic grammar, usage, style and organization." E-Rater's comments are inadequate at best, and Criterion's leave a lot to be desired. RocketScore provides substantial feedback on how to improve your writing. Not just stylistic and grammatical comments, but comments on what you should be writing more about (you didn't provide enough info!), what you should be writing less about (you gave too much info!), and how to balance your arguments, among many other categories.

    There are two major problems with essay grading. The first is bullshit detection, and the second is determining if the essay actually answered the question asked. E-rater and Criterion both have real problems with these two criteria. With bullshit detection, RocketScore has threshholds which can be set and manipulated on the fly, from throwing out anything which isn't completely relevant to the topic, to allowing just about any essay submitted. And you will get a score and comments based upon what you submitted. Of course, these are most helpful when you make a meaningful attempt to submit a relevant essay.

    "The machine score and the human score are in agreement 97 percent to 98 percent of the time."

    Yes, but do you know how ETS defines "agreement"? Glad you asked. When the grader's grade is within a point of the human's grade. Now, with the SAT 2 test, which is on a scale of 1 through 6, that means if the grader says 2, and a human says 1, 2, or 3, then there's agreement. But that's 50% of the scale! Their essay grader has a 98% chance of hitting the wall in front of them as opposed to the wall next to them. Woohoo. Meanwhile, RocketScore provides decimal point accuracy (we don't give you a 4 or a 5, we give you a 4.1, or 5.3), and is 98% accurate. But how do we define accurate? When the grader's grade is rounded to the nearest whole number, and that number is the human's grade. In other words, if we give you a 4.3, there is a 98% chance a human would give you a 4. With 4.5,

  • by Anonymous Coward on Sunday September 07, 2003 @01:07AM (#6891406)
    Of course, ETS has yet to divulge the details of the technology they use.
    Could this be it?

    As I understand it, ETS poured tens of millions of dollars into the automated essay grading effort, in parallel with the development of the CBT format. For years after the CBT was introduced, the GRE essay was still done on paper.

    As the grading software finally worked on the database of sample essays, the GRE essay switched away from paper to word processing entry (something many test takers have difficulty with).

    Still, the essays were graded by armies of grad students. Only when the automated grading matched the manual grading 90% of the time did the software "go live".

    IIRC, it has been used for only a few years. After over a decade of development.

  • by ergo98 ( 9391 ) on Sunday September 07, 2003 @01:08AM (#6891409) Homepage Journal
    "I heard a statistic once that if you chose answers randomly on a MC test that you could get a C by not knowing anything beyond how to circle a letter!"

    You "heard a statistic once"? Geez, the probability statistics aren't that difficult: If there's 4 possible answers, and you randomly pick, you'll likely get about 25% right, or 5/20, 3/33. It isn't rocket science. To get 50% randomly there'd have to be only two possible choices. Add to that the fact that many post secondary multiple choice tests actually deduct marks for incorrect answers, and your C proclamation sounds like it might be incorrect.

  • Re:Interesting.. (Score:5, Informative)

    by dieman ( 4814 ) * on Sunday September 07, 2003 @01:33AM (#6891481) Homepage

    I took a old college paper [ringworld.org] that I wrote and plugged it into the program and got 100% on everything except for creativity (99.973). Considering that I don't think I got a 'perfect' score on this paper, I'm really surprised by the scores. :)

    How great though, throwing a paper about the fear of technology through something many people (rightfully) fear. :)

  • by ahfoo ( 223186 ) on Sunday September 07, 2003 @02:16AM (#6891588) Journal
    I wrote in my journal about this awhile back. ETS was trying to sell their essay grader to a group of the local test prep chains here in Taiwan. The local schools called me in to sit in on the presentation. Before I had gone in, I searched around and found numerous free and open implementations and I asked the speaker why they were selling their academic software for so much money --it was a rather complex contract on a per seat basis-- when there were similar product available for free. Their rep claimed to be unamare of any similar open sourced products that could match the amazing and advanced artificial intelligence features they were offering. Sales reps --hmm. The mere posing of question definitely made them stutter and squirm though.
    But the interesting part was after I got home. I looked at ETS's own research monologues and found that internally this overpriced system had been debunked. It was discovered that by writing one well-formed short paragraph and then cutting and pasting it over and over an almost perfect score could be attained. The more times it was pasted, the higher the score.
    It was also possible to write an essay on an unrelated topic and still get a high score allowing students to use rote memoriziation of a single model essay. This, natually, is impossible with a human reader because they can tell what the topic is fairly easily. According to the sales literature this software could to, but in actual tests that didn't hold up.
    Their sales literature claimed that the software contained aritificial intelligence and thus implied that such simple techniques would not fool it, but in practice this was far from the case.
    Monographs published by ETS also made it clear that despite their aggressive marketing of this product outside the US, they were not planning to use it as an exclusive grading system on their own tests. Rather, it was to be used as a teaching tool. However, it took a lot of digging to uncover that information.
    Just as with translation, there's a lot of financial motivation to make this technology work, but that doesn't necessarily translate into workable products. In the nineties when spelling and grammar checking was already old hat and English/Euro translation was making such headway I thought fluent Chinese/English translation was just a few years away. Now it's 2003, grammar checkers still only work if you write in prescribed style and I've yet to see something halfway decent in Chinese/English translation software although you still hear claims all the time for some overpriced product that's really almost there.
    I think we'll see dramatic life extension long before we see decent computer essay graders. Decent trade as far as I'm concerned. As for translation, we can always teach more languages in school.
  • Comment removed (Score:5, Informative)

    by account_deleted ( 4530225 ) on Sunday September 07, 2003 @09:41AM (#6892466)
    Comment removed based on user account deletion

You knew the job was dangerous when you took it, Fred. -- Superchicken

Working...