mahmood726
10 hours ago
I’m an academic working on reliability for high-stakes LLM use (coding + scientific/medical workflows). This repo proposes a “fail-closed” certification gate: an output only ships if it passes published checks; otherwise it rejects. The benchmark emphasis is on false-ship rate (shipped-but-wrong), not just accuracy. Looking for critique and real failure cases: where do LLMs most often produce plausible outputs that are silently wrong (C#/.NET, SQL, Python notebooks, data extraction, etc.)? What validation checks would you consider non-negotiable?