I designed a system like this years ago (the other large company doing essay scoring as part of their big college entrance exam)...and part of the methodology was that the other raters were never supposed to know what the computer rated the paper. If the computer and the humans were within 1 point (6 point scale...pretty much if they were within one standard deviation)...the essay was assigned the average score. If not, it went to two other humans, and the score was averaged.
The biggest thing we had to work with the raters was that they were NOT to rate content. They were to rate writing. This was an uphill battle because raters wanted to have their opinion built into the score, and after rating 5000 papers on specific subjects, they felt they were an expert on the field. However, I specifically had to tell them over and over that content was not to be rated. It could be gibberish...most of these essays were intended to have a student come in blind and write about a subject (or a choice between a few subjects) and may not know a damn thing about any of the subjects at hand. And yet were being rated on their knowledge by these people.
I know this was the instruction I heard most ignored from the Pearson team competing with us (this world is a small small world...most of us would talk about how ours worked and confident enough that ours world better that we didn't care what we said to others).
So yeah, as long as it was well written gibberish, the computer would rate these higher.
That said, I don't know any team that is intending computers to be the sole arbiter of the grade. Augmented rating, yes...sole rater? No.
And I know within my system, I had a 70%+ chance of my system in agreeance with the group mean (6 point scale...Pearson used a 4 point and bragged about the same agreeance rates). And close to a 90% of within a standard deviation. If you took any single rated from a 4 person team and put them against the mean? You'd realize the computer was more accurate than the humans.
In the end, I wanted to use out system to help educators assign more writings, and would allow students to submit their work to test it, and write against it. We also had a way that students could ask our experts to help with writing if they didn't like the ratings (we also had a much expanded scale for non-ratings purposes...the last model I trained it on was the 6+1 system -- http://educationnorthwest.org/resource/949 to allow for not just overall score, but subscores...in addition to all the boring spelling, grammar and all that bullshit). It allowed students to write more, and write in a more self-directed manner. It also allowed the teacher to grade with a scalpel as opposed to a hatchet and fix the writings before things were turned in for the final grade.
Back to the parent poster's message...your wife probably wasn't supposed to be rating content. Then again, I found entire teams over the years that had their own rules that had nothing to do with the carefully crafted directions I created...and I would imagine that other large assessment teams at different orgs would have found the same (especially when you are actually using these in a production setting as opposed to a sterile lab setting).