mattchamb
a day ago
Possibly overly cynical, but I think a lot of people who are getting forced to shoehorn LLMs into their applications don’t have a choice about NOT using an LLM for the required use case.
Testing and validation kind of imply that it won’t ship if it isn’t fit for purpose, which isn’t an option and most devs will know it’s kind of junk already.
That’s not to say there aren’t good use cases, but when forced to add an LLM somewhere it doesn’t work, and you have no examples of “correct” output anyway, validation is usually an afterthought.
alexkirwan
11 hours ago
Not cynical at all. I think you're highlighting a real problem in the industry, and certainly something we've seen - teams, for a number of reasons (optics, marketing, hype/vibes, experimentation, pressure to adopt AI) use LLMs without perhaps proper consideration. Thats actually the opposite of what we're advocating for.
The whole point of proper testing is to determine if an LLM is suitable for your specific task, and then testing and measuring further to optimise for the outcome you want. The post refers to more testing LLMs at Scale, and the use cases we refer to assume a system design took place where the use of an LLM was deemed necessary for the task. Teams absolutely should have the option to determine if an LLM is not "fit for purpose". Reva actually helps with this - Good testing and validation during the experimentation stage often reveals when a simpler solution works better. But "pressure" can come in many forms, and I have empathy for teams that perhaps are not in a healthy environment where saying no is part of the culture.
We're working with teams that have real use cases, and we've seen a real problem with how teams are testing their use of LLMs. Its hard. Especially at scale! We built infrastructure allowing you to test with your own real historic data allowing you to measure actual performance improvements (or regressions) against the business outcome, rather than "yep, looks good!"