michaelt
6 hours ago
> We surveyed students before releasing grades to capture their experience. [...] Only 13% preferred the AI oral format. 57% wanted traditional written exams. [...] 83% of students found the oral exam framework more stressful than a written exam.
[...]
> Take-home exams are dead. Reverting to pen-and-paper exams in the classroom feels like a regression.
Yeah, not sure the conclusion of the article really matches the data.
Students were invited to talk to an AI. They did so, and having done so they expressed a clear preference for written exams - which can be taken under exam conditions to prevent cheating, something universities have hundreds of years of experience doing.
I know some universities started using the square wheel of online assessment during covid and I can see how this octagonal wheel seems good if you've only ever seen a square wheel. But they'd be even better off with a circular wheel, which really doesn't need re-inventing.
BoiledCabbage
6 hours ago
That's what so surprising to me - they data clearly shows the experiment had terrible results. And the write up is nothing but the author stating: "glowing success!".
And they didn't even bother to test the most important thing. Were the LLM evaluations even accurate! Have graders manually evaluate them and see if the LLMs were even close or were wildly off.
This is clearly someone who had a conclusion to promote regardless of what the data was going to show.
knallfrosch
5 hours ago
I found "well, the LLMs converge when given each other's scores, so they agree and are correct" to be quite the jump to a conclusion.
cvoss
6 hours ago
The quote you gave is not the conclusion of the article. It's a self-evident claim that just as well could have been the first sentence of the article ("take-home exams are dead"), followed by an opinion ("reverting ... feels like a regression") which motivated the experiment.
Some universities and professors have tried to move to a take-home exam format, which allows for more comprehensive evaluation with easier logistics than a too-brief in-class exam or an hours-long outside-of-class sitting where unreasonable expectations for mental and sometimes physical stamina are factors. That "take-home exams are dead" is self-evident, not a result of the experiment in the article. There used to be only a limited number of ways to cheat at a take-home exam, and most of them involved finding a second person who also lacked a moral conscience. Now, it's trivial to cheat at a take-home exam all by yourself.
You also mentioned the hundreds of years of experience universities have at traditional written exams. But the type and manner of knowledge and skills that must be tested for vary dramatically by discipline, and the discipline in question (computer science / software engineering) is still new enough that we can't really say we've matured the art of examining for it.
Lastly, I'll just say that student preference is hardly the way to measure the quality of an exam, or much of anything about education.
michaelt
6 hours ago
> The quote you gave is not the conclusion of the article.
Did I say "conclusion" ? Sorry, I should have said the section just before the acknowledgements, where the conclusion would normally be, entitled "The bigger point"
Nifty3929
4 hours ago
I think this is the actual conclusion: "Now, AI is making them scalable again."
That is, the author concluded that AI tools provide viable alternatives to the other available options, and which solve many of their problems.
NewsaHackO
6 hours ago
The issue is that it is not scalable, unless there is some dependable, automated way to convert handwriting to text.
pgalvin
6 hours ago
University exams being marked by hand, by someone experienced enough to work outside a rigid marking scheme, has been the standard for hundreds of years and has proven scalable enough. If there are so many students that academics can’t keep up, there are likely too many students to maintain a high standard of education anyway.
unbrice
5 hours ago
> there are likely too many students to maintain a high standard of education anyway.
Right on point. I find particularly striking how little is said about whether the best students achieve the best grades. Authors are even candid that different LLMs asses differently, but seem to conclude that LLMs converging after a few rounds of cross reviews indicate they are plausible so who cares. The apparences are safe.
aaplok
4 hours ago
A limitation of written exams is in distance education, which simply was hardly a thing for the hundreds of years exams were used. Just like WFH is a new practice employers have to learn to deal with, study from home (SFH) is a phenomenon that is going to affect education.
The objections to SFH exist and are strikingly similar to objections to WFH, but the economics are different. Some universities already see value in offering that option, and they (of course) leave it to the faculty to deal with the consequences.
sarchertech
3 hours ago
Distance education is a tiny percentage of higher education though. Online classes at a local university are more common, but you can still bring the students in for proctored exams.
Even for distance education though, proctored testing centers have been around longer than the internet.
Kwpolska
6 hours ago
Why is this a problem now, but was not a problem for the past few centuries? This class had 36 students, you could grade that in a single evening.
aleph_minus_one
an hour ago
> Why is this a problem now, but was not a problem for the past few centuries? This class had 36 students, you could grade that in a single evening.
At least in Germany, if there are only 36 students in a class, usually oral exams are used because in this case oral exams are typically more efficient. For written exams, more like 200-600 students in a class is the common situation.
NewsaHackO
5 hours ago
I agree with you and the other posters actually, but I think the efficiency compared with typed work is the reason it’s having such a slow adoption. Another thing to remember is that there is always a mild Jevons paradox at play; while it's true that it was possible in previous centuries, teacher expectations have also increased which strains the amount of time they would have grading handwritten work.
recursivecaveat
3 hours ago
It is literally perfect linear scaling. For every student you must expend constant minutes of TA time grading the exam. Why is it unconscionable that the university should have an expense scale at the same rate it receives tuition revenue? $90,000 of tuition pays for a lot of grading hours. I feel that scalability is a cultural meme that has lost the plot.
andrepd
2 hours ago
There are phrases that hn loves and "scalable" is one of them. Here, it is particularly inappropriate.
Some people dream that technology (preferably duly packaged by for-profit SV concerns) can and will eventually solve each and every problem in the world; unfortunately what education boils down to is good, old-fashioned teaching. By teachers. Nothing whatsoever replaces a good, talented, and attentive teacher, all the technologies in the world, from planetariums to manim, can only augment a good teacher.
Grading students with LLMs is already tone-deaf, but presenting this trainwreck of a result and framing it as any sort of success... Let's just say it reeks of 2025.
gwern
an hour ago
To clarify the point here for people who didn't read OP: the oral exams here are customized and tailored to the student's individual unique project, that's the point and why they are not written:
> In our new "AI/ML Product Management" class, the "pre-case" submissions (short assignments meant to prepare students for class discussion) were looking suspiciously good. Not "strong student" good. More like "this reads like a McKinsey memo that went through three rounds of editing," good...Many students who had submitted thoughtful, well-structured work could not explain basic choices in their own submission after two follow-up questions. Some could not participate at all...Oral exams are a natural response. They force real-time reasoning, application to novel prompts, and defense of actual decisions. The problem? Oral exams are a logistical nightmare. You cannot run them for a large class without turning the final exam period into a month-long hostage situation.
Written exams do not do the same thing. You can't say 'just do a written exam'. So sure, the students may prefer them, but so what? That's apples and oranges.
Aurornis
6 hours ago
When college degrees cost as much as they do, it's reasonable to pay people to do the transcription and/or grading.
Work study and TA jobs were abundant when I was in college. It wasn't a problem in the past and shouldn't be a problem now.
vkou
5 hours ago
I assure you, oral exams are completely scalable. But it does require most of a university's budget to go towards labs and faculty, and not administration and sports arenas and social services and vanity projects and three-star dorms.
musicale
5 hours ago
One way of scaling out interactive/oral assessment (and personalized instruction in general) is to hire a group of course assistants/tutors from the previous cohort.
vkou
4 hours ago
So, TAs. The other half of the mission-critical staff that keeps a university running.
musicale
4 hours ago
I think it works differently at different schools and in different countries, but hourly (often undergraduate work-study) course assistants in the US can be very affordable since they typically still pay tuition and are paid at a lower rate than fully funded (usually graduate student) TAs.
andrepd
2 hours ago
> sports arenas and social services and vanity projects and three-star dorms
One of these is not like the others.
vasco
5 hours ago
One student had to talk to an AI for more than 60 minutes. These guys are creating a dystopia. Also students will just have an AI pick up the phone if this gets used for more than 2 semesters.
WJW
4 hours ago
Regular exams definitely take more than a single hour though. How is this bad?
j_w
5 hours ago
It's not that the oral format should be dismissed, just that the idea of your exam being speaking to a machine to be judged on the merit of your time in a course is dystopian. Talking to another human is fine.
makeitdouble
4 hours ago
How different is it in essence from checking boxes to be scanned by a machine and auto-evaluated to get a one dimention numerical score ?
Have exams ever been about humanity and the optics of it ?
sarchertech
3 hours ago
Very different. A scantron machine is deterministic and non-chaotic.
In addition to being non-deterministic LLMs can product vastly different output from very slightly different input.
That’s ignoring how vulnerable LLMs are to prompt injection, and if this becomes common enough that exams aren’t thoroughly vetted by humans, I expect prompt attacks to become common.
Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
makeitdouble
3 hours ago
I saw this piece as the start of an experiment, and the use of a "council of AI" as they put it to average out the variability sounds like a decent path to standardization to me (prompt injecting would not be impossible, but getting something past all the steps sounds like a pretty tough challenge)
They mention getting 100% agreement between the LLMs on some questions and lower rates on other, so if an exam was composed of only questions where there is near 100% convergence, we'd be pretty close to a stable state.
I agree it would be reassuring to have a human somewhere in the loop, or perhaps allow the students to appeal the evaluation (at cost?) if they is evidence of a disconnect between the exam and the other criteria. But depending on how the questions and format is tweaked we could IMHO end up with something reliable for very basic assessments.
PS:
> Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
Nothing indeed. The arms race hasn't started here, and will keep going IMO.
sarchertech
2 hours ago
> Nothing indeed.
So the whole thing is a complete waste of time then as an evaluation exercise.
>council of AIs
This only works if the errors and idiosyncrasies of different models are independent, which isn’t likely to be the case.
>100% agreement
When different models independently graded tests 0% of grades matched exactly and the average disagreement was huge.
They only reached convergence on some questions when they allowed the AIs to deliberate. This is essentially just context poisoning.
1 model incorrectly grading a question will make the other models more likely to incorrectly grade that question.
If you don’t let models see each other’s assessments, all it takes is one person writing an answer in a slightly different way that causes disagreement among models to vastly alter the overall scores by tossing out a question.
This is not even close to something you want to use to make consequential decisions.
Eisenstein
2 hours ago
A technological solution to a human problem is the appeal we have fallen for too many times these last few decades.
Humans are incredibly good at solving problems, but while one person is solving 'how do we prevent students from cheating' a student is thinking 'how I bypass this limitation preventing me from cheating'. And when these problems are digital and scalable, it only takes one student to solve that problem for every other student to have access to the solution.
reincarnate0x14
2 hours ago
A Fire Upon the Deep coming to your classroom!