Looking at the linked scoring prompt (resume_evaluation_criteria.jinja) [0], I immediately see several red flags that suggest the output won't be reliable. (I'm developing an LLM intensive application where the stakes are high enough that I need the LLM output to be reasonably correct.)
[0] https://github.com/interviewstreet/hiring-agent/blob/main/pr...
In no particular order:
1. The prompt is trying to get the system to do all of the evaluation steps at once. Instead, the system should break down the task of resume evaluation into its subcomponents and have separate prompts for each component. Like "evaluating open source contributions" should be its own task. Same with "assessing the complexity of software projects on the resume." Fwiw, each of the tasks contained within the prompt is woefully underspecified.
2. The prompt leaves spreads of ~10 points up to the LLM, when it's doubtful that humans are that well calibrated. Take for example:
> SCORING CRITERIA Open Source (0-35 points)
HIGH SCORES (25-35 points):
- Contributions to popular open source projects (1000+ stars)
- Significant contributions to well-known projects
- Google Summer of Code (GSoC) participation
- Substantial community involvement
Are all of these 35-point examples? Is one a 26-point example? If not, what's the difference? If an expert can't reliably make the judgement, the LLM is going to struggle too. One partial fix is to get rid of the ranges and just say all of these are worth 30 points. An additive point scheme would be better...
3. The authors of this prompt have left an incredible number of judgement calls up to the LLM, when that's the very thing you want to minimize. Using the same example as above...
- Are all contributions to open source projects with 1000+ stars equal?
- What counts as a "significant contribution"? Doesn't that imply that the LLM has to know or read through all of the commits in like the last ~6 months at minimum for the project to understand what the given contribution meant to the project? That itself isn't impossible with tool usage, but again, that'd be a separate task.
- What on earth counts as "Substantial community involvement"? Why didn't the prompt authors define this, or at least give a few examples?
Honestly at this point maybe someone should build a tool that scans prompts for adjectives...
4. This sort of thing is just asking for trouble:
> SCORES MUST NEVER DEPEND ON:
Candidate's name, gender, or personal demographic information
Just remove this stuff before you send the rest of the resume to the LLM. Even if you ask it not to, it's not a person, it's a very fancy statistical distribution generator. All of the input (including the name) will affect the distribution that gets generated. (This one is not unlike Andreessen's "don't be a sycophant" prompt.)
5. Obviously this one depends on the LLM in question, but instead of writing things like:
> DO NOT RETURN A RESUME SUMMARY. RETURN ONLY THE SCORING EVALUATION IN THE SPECIFIED JSON FORMAT. Analyze the following resume and provide a JSON response with this EXACT structure (all fields are required):...
The system should utilize the "structured output" option, which guarantees a fixed output format. Also, fwiw, the JSON should force the LLM to pick between categorical options as much as possible. Forced-choice structured output should, at least in theory, cut down on hallucinatory responses and constrain judgement calls.
6. One major thing that's not in the prompt is anything about traceability. This system should be designed so that humans can review the logs and make sure this is working as intended.
7. Another thing that is missing in the file is what I'll call evidence of a theory of coding / coder quality. Most of the examples are designed to have the LLM assess proxies for code quality, not code quality itself. Surely both should be taken into account?
I'm not an expert at evaluating coders. But two pretty basic LLM-answerable thing I would ask is: How well do a candidate's 5 most recent commit messages match the contents of those commits? Do the claimed technical skills on the resume match their GitHub code? (i.e., if they say they know R, is there any evidence of that on their GitHub?)
8. The prompt also seems unaware of what it's asking the LLM to do:
> LIVE DEMO BONUS: Projects with working live demos should receive 10-20% higher scores
This implies that the LLM can use tools, but even then, I'd be pretty wary of its ability to fully execute this part of the prompt without more detailed instructions, examples, and guidance. There are very likely tons of edge cases here.