songeater
2 days ago
Is gpt-3.5-turbo-instruct function calling a chess-playing model instead of generating through the base LLM?
This is not "cheating" in my opinion... in general better for LLMs to know when to call certain functions, etc.
air7
2 days ago
I don't know... It's like claiming that Samsung "enhanced their phone camera abilities" when they replaced zoomed-in moon shots with hi-res images of the moon.
https://www.samsungmobilepress.com/feature-stories/how-samsu...
delecti
2 days ago
I think that's meaningfully different. If you ask for chess advice, and get chess advice, then your request was fulfilled. If you ask for your photo to be optimized, and they give you a different photo, they haven't fulfilled your request. If GPT was giving Go moves instead of Chess moves, then it might be a better comparison, or maybe generating random moves. The nature of the user's intent is just too different.
fsckboy
2 days ago
>it's like claiming that Samsung "enhanced their phone camera abilities" when they replaced zoomed-in moon shots with hi-res images of the moon
to be fair, the human visual system does the same
HarHarVeryFunny
2 days ago
It's cheating to the extent that it misrepresents the strength and reasoning ability of the model, to the extent that anyone is going to look at it's chess playing results and incorrectly infer this says anything about how good the model is.
The takeaway here is that if you are evaluating different models for your own use case, the only indication of how useful each may be is to test it on your actual use case, and ignore all benchmarks or anything else you may have heard about it.
thrw42A8N
2 days ago
It represents the reasoning ability of the model to correctly choose and use a tool... Which seems more useful than a model that can do chess by itself but when you need it to do something else, it keeps playing chess.
vundercind
2 days ago
Where it’ll surprise people is if they don’t realize it’s using an external tool and expect it to be able to find solutions of similar complexity to non-chess problems, or if they don’t realize this was probably a special case added to the program and that this doesn’t mean it’s, like, learned how to go find and use the right tool for a given problem in a general case.
I agree that this is a good way to enhance the utility of these things, though.
HarHarVeryFunny
2 days ago
It doesn't take much to recognize a sequence of chess moves. A regex could do that.
If what you want is intelligence and reasoning, there is no tool for that - LLMs are as good as it gets for now.
At the end of the day it either works on your use case, or it doesn't. Perhaps it doesn't work out of the box but you can code an agent using tools and duct tape.
thrw42A8N
2 days ago
Do you really think it's feasible to maintain and execute a set of regexes for every known problem every time you need to reason about something? Welcome to the 1970s AI winter...
HarHarVeryFunny
2 days ago
No I don't - I'm saying that tool use is no panacea, and availability of a chess tool isn't going to help if what YOU need is a smarter model.
thrw42A8N
2 days ago
Sure, but how do you train a smarter model that can use tools, without first having a less smart model that can use tools? This is just part of the progress. I don't think anyone claims this is the endgame.
HarHarVeryFunny
2 days ago
I really don't understand what point you are trying to make.
Your original comment about a model that might "keep playing chess" when you want it to do something else makes no sense. This isn't how LLMs work - they don't have a mind of their own, but rather just "go with the flow" and continue whatever prompt you give them.
Tool use is really no different than normal prompting. Tools are internally configured as part of the hidden system prompt. You're basically just telling the model to use a specific tool in specific circumstances, and the model will have been trained to follow instructions, so it does so. This is just the model generating the most expected continuation as normal.
simonw
2 days ago
"Is gpt-3.5-turbo-instruct function calling a chess-playing model instead of generating through the base LLM?"
I'm absolutely certain it is not. gpt-3.5-turbo-instruct is one of OpenAI's least important models (by today's standard) - it exists purely to give people who built software on top of the older completion models something to port their code to (if it doesn't work with instruction tuned models).
I would be stunned if OpenAI had any special-case mechanisms for that model that called out to other systems.
When they have custom mechanisms - like Code Interpreter mode - they tell you about them.
I think it's much more likely that something about instruction tuning / chat interferes with the model's ability to really benefit from its training data when it comes to chess moves.
HarHarVeryFunny
2 days ago
It should be easy to test for. An LLM playing chess itself tries to predict the most likely continuation of a partial game it is given, which includes (it has been shown) internally estimating the strength of the players to predict equally strong or weak moves.
If the LLM is just pass through to a chess engine, then it more likely to play at the same strength all the time.
It's not clear in the linked article how many moves the LLM was given before being asked to continue, or if these were all grandmaster games. If the LLM still crushes it when asked to continue a half played poor quality game, then that'd be a good indication it's not an LLM making the moves (since it would be smart enough to match the poor quality of play).
jiggawatts
2 days ago
This this the point!
LLMs have this unique capability. Yet, every AI company seems hell bent on making them... not have that.
I want the essence of this unique aspect, but better, not this unique aspect diluted with other aspects such as the pure logical perfection of ordinary computer software. I already have that!
The problem with every extant AI company is that they're trying to make finished, integrated products instead of a component.
It's as-if you just wanted a database engine and every database vendor insisted on selling you a shopfront web app that also happens to include a database in there somewhere.
wjnc
2 days ago
If it read the CLI docs it might just make the right calls (x —-ELO:1400).
kreyenborgi
2 days ago
If that's what it does, then it's "cheating" in the sense that people think they're interacting with an LLM, but they're actually interacting with an LLM + chess engine. This could give the impression that LLM's are able to generalize to a much broader extent than they actually are – while it's actually all just a special-purpose hack. A bit like putting invisible guard rails on some popular difficult test road for self-driving cars – it might lead you to think that it's able to drive that well on other difficult roads.
msylvest
2 days ago
Calling out to some chess-playing-function would be a deviation from the pure LLM paradigm. As a medium-level chess player I have walked through some of the LLM victories (ChatGPT 3-5-turbo-instruction); I find it is not very good at winning by mate - it misses several chances of forced mate. But forced mate is what chess engines are good at - can be calculated by exhaustive search of valid moves in a given board position.
So I'm arguing that it doesn't call out - it should gotten better advice if it did.
But I remain amazed that OP does not report any illegal moves made any of by LLMs. Assuming training material includes introductory texts of chess playing and a lot of chess games in textual notation (e.g. PGN) I would expect at least occasional illegal moves since the rules are defined in terms of board positions. And board positions are a non-trivial function of the set of moves made in a game. Does an LLM silently perform a transformation of the set of moves to a board position? Can LLMs, during training, read and understand board-position diagrams of chess books?
gs17
a day ago
> But I remain amazed that OP does not report any illegal moves made any of by LLMs.
They did (but not enough detail to know how much of an impact it had):
> For the open models I manually generated the set of legal moves and then used grammars to constrain the models, so they always generated legal moves. Since OpenAI is lame and doesn’t support full grammars, for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly.
gs17
2 days ago
I don't think it is, since OpenAI never mentions that anywhere AFAIK. That would be a really niche feature to include and then drop instead of building on more.
songeater
2 days ago
gpt 3.5 has had function calling capability since July 2023 (for user). [1]
Yes, they have never mentioned that the 3.5 model already does this in the back-end for certain features.
Anyone at OpenAi care to comment... not a particularly controversial topic.
[1] https://openai.com/index/function-calling-and-other-api-upda...
PeterStuer
2 days ago
That was also my first thought. The discrepancy is just too large to be the mere result of a transformer model fed more chess data.
basman
a day ago
Based on looking at the games at the end of the post, it seems unlikely. Both sides play extremely poorly — gpt-instruct is just slightly less bad — and I don't see any reasonable engine outputting those moves.
jerf
2 days ago
That seems the most likely scenario to me.
Helping that along is that it's an obvious scenario to optimize, for all kinds of reasons. One of them being that it is a fairly good "middle of the road" test for integrating with such systems; not as trivial as "Let's feed '1 + 1' to a calculator" and nowhere near as complicated as "let's simulate an entire web page and pretend to click on a thing" or something.
throwaway314155
2 days ago
Why would they only incorporate a chess engine into (seemingly) exactly one very old, dated model? The author tests o1-mini and gpt-4o. They both fail at chess.
jerf
2 days ago
Because they decided it wasn't worth the effort. I can point to any number of similar situations over the many years I've been working on things. Bullet-point features that aren't pulling their weight or are no longer attracting the hype often don't transition upgrades.
A common myth that people have is that these companies have so much money they can do everything, and then they're mystified by things like bugs in Apple or Microsoft projects that survive for years. But from any given codebase, the space of "things we could do next" is exponential. That defeats any amount of money. If they're considering porting their bespoke chess engine code up to the next model, which absolutely requires non-trivial testing and may require non-trivial work, even for the richest companies in the world it is still an opportunity cost and they may not choose to spend their time there.
I'm not saying this is the situation for sure; I'm saying that this explanation is sufficient that I'm not going "oh my gosh this situation just isn't possible". It's definitely completely possible and believable.
kardos
2 days ago
If the goal is to produce a LLM-like interface that generates correct output, then sure, it's not cheating..... but is it really a data-driven LLM at that point? If the LLM amounts to a chat-frontend that calls a host of human-prepared programs or draws from human-prepared databases, etc, it's starting to sound a lot more like Wolfram Alpha v2 than a LLM, and strikes me as walking away from AGI rather than toward it