insane_dreamer
6 hours ago
The problem is eventually what are LLMs going’s to draw from? They’re not creating new information, just regurgitating and combining existing info. That’s why they perform so poorly on code for which there aren’t many many publicly available samples, SO/reddit answers etc.
zmmmmm
an hour ago
It may be an interesting side effect that people stop so gratuitously inventing random new software languages and frameowrks because the LLMs don't know about it. I know I'm already leaning towards tech that the LLM can work well with, simply because being able to ask the LLM to solve 90% of the problem outweighs any marginal advantage using a slightly better language or framework offers. Fro example, I dislike Python as a language pretty intensely, but I can't deny that the LLMs are significantly better in Python than many other languages.
A4ET8a8uTh0
39 minutes ago
Alternatively, esoteric languages and frameworks will become even more lucrative ,simply because only the person who invented them and their hardcore following will understand half of it.
Obviously, not a given, but not unreasonable given what we have seen historically.
nfw2
an hour ago
Fwiw, GPT o1 helped me figure out how a fairly complex use case of epub.js, an open-source library with pretty opaque documentation and relatively few public samples. It took a few back-and-forths to get to a working solution, but it did get there.
It makes me wonder if the AI successfully found and digested obscure sources on the internet or was just better at making sense of the esoteric documentation than me. If the latter, perhaps the need for public samples will diminish.
TaylorAlexander
an hour ago
Well Gemini completely hallucinated command line switches on a recent question I asked it about the program “john the ripper”.
We absolutely need public sources of truth at the very least until we can build systems that actually reason based on a combination of first principles and experience, and even then we need sources of truth for experience.
You simply cannot create solutions to new problems if your data gets too old to encompass the new subject matter. We have so systems which can adequately determine fact from fiction, and new human experiences will always need to be documented for machines to understand them.
kachapopopow
an hour ago
Experienced the same thing with a library that has no documentation and takes advantage of c++23(latest) features.
neither_color
5 hours ago
I find that it sloppily goes back and forth between old and new methods, and as your LLM spaghetti code grows it becomes incapable of precision adding functions without breaking existing logic. All those tech demos of it instantly creating a whole app with one or a few prompts are junk. If you don't know what you're doing then as you keep adding features it WILL constantly switch up the way you make api calls(here's a file with 3 native fetch functions, let's install and use axios for no reason), the way you handle state, change your css library, etc.
{/* rest of your functions here*} - DELETED
After a while it's only safe for doing tedious things like loops and switches.
So I guess our jobs are safe for a little while longer
emptiestplace
an hour ago
Naively asking it for code for anything remotely complex is foolish, but if you do know what you're doing and understand how to manage context, it's a ridiculously potent force multiplier. I rarely ask it for anything without specifying which libraries I want to use, and if I'm not sure which library I want, I'll ask it about options and review before proceeding.
n_ary
6 hours ago
LLMs show their limits as you try to ask something new(introduced in last 6-12 months) being not used. I was asking Claude and GPT4o about a new feature of go, it just gave me some old stuff from go docs. Then I went to go docs(official) and found what I was looking for anyways, the feature was released 2 major versions back, but somehow neither GPT4o nor claude know about this.
SunlitCat
6 hours ago
With GPT 4o I had some success pointing it to the current documentation of projects I needed and had it giving me current and actual answers.
Like "Help me to do this and that and use this list of internet resources to answer my questions"
fullstackwife
an hour ago
The answer is already known, and it is a multi billion dollars business: https://news.ycombinator.com/item?id=41680116
stickfigure
6 hours ago
> The problem is eventually what are LLMs going’s to draw from?
Published documentation.
I'm going to make up a number but I'll defend it: 90% of the information content of stackoverflow is regurgitated from some manual somewhere. The problem is that the specific information you're looking for in the relevant documentation is often hard to find, and even when found is often hard to read. LLMs are fantastic at reading and understanding documentation.
Const-me
6 hours ago
That is only true for trivial questions.
I've answered dozens of questions on stackoverflow.com with tags like SIMD, SSE, AVX, NEON. Only a minority of these asked for a single SIMD instruction which does something specific. Usually people ask how to use the complete instruction set to accomplish something higher level.
Documentation alone doesn't answer questions like that, you need an expert who actually used that stuff.
irunmyownemail
6 hours ago
Published documentation has been and can be wrong. In the late 1990's and early 2000's when I still did a mix of Microsoft technologies and Java, I found several bad non-obvious errors in MSDN documentation. AI today would likely regurgitate it in a soft but seemingly mild but arguably authoritative sounding way. At least when discussing with real people after the arrows fly and the dust settles, we can figure out the truth.
Ferret7446
4 hours ago
Everything (and everyone for that matter) can be and has been wrong. What matter is if it is useful. And AI as it is now is pretty decent at finding ("regurgitating") information in large bodies of data much faster than humans and with enough accuracy to be "good enough" for most uses.
Nothing will ever replace your own critical thinking and judgment.
> At least when discussing with real people after the arrows fly and the dust settles, we can figure out the truth.
You can actually do that with AI now. I have been able to correct AI many times via a Socratic approach (where I didn't know the correct answer, but I knew the answer the AI gave me was wrong).
roughly
6 hours ago
Yeah, this is wildly optimistic.
From personal experience, I'm skeptical of the quantity and especially quality of published documentation available, the completeness of that documentation, the degree to which it both recognizes and covers all the relevant edge cases, etc. Even Apple, which used to be quite good at that kind of thing, has increasingly effectively referred developers to their WWDC videos. I'm also skeptical of the ability of the LLMs to ingest and properly synthesize that documentation - I'm willing to bet the answers from SO and Reddit are doing more heavy lifting on shaping the LLM's "answers" than you're hoping here.
There is nothing in my couple decades of programming or experience with LLMs that suggests to me that published documentation is going to be sufficient to let an LLM produce sufficient quality output without human synthesis somehwere in the loop.
lossolo
an hour ago
Knowledge gained from experience that is not included in documentation is also significant part of SO. For example "This library will not work with service Y because of X, they do not support feature Y, as I discovered when I tried to use it myself" or other empirical evidence about the behavior of software that isn't documented.
elicksaur
6 hours ago
Following the article’s conclusion farther, humans would stop producing new documentation with new concepts.
jsemrau
2 hours ago
Data annotation is a thing that will be a huge business going forward.
finolex1
6 hours ago
There is still publicly available code and documentation to draw from. As models get smarter and bootstrapped on top of older models, they should need less and less training data. In theory, just providing the grammar for a new programming language should be enough for a sufficiently smart LLM to answer problems in that language.
Unlike freeform writing tasks, coding also has a strong feedback loop (i.e. does the code compile, run successfully, and output a result?), which means it is probably easier to generate synthetic training data for models.
layer8
5 hours ago
> In theory, just providing the grammar for a new programming language should be enough for a sufficiently smart LLM to answer problems in that language.
I doubt it. Take a language like Rust or Haskell or even modern Java or Python. Without prolonged experience with the language, you have no idea how the various features interact in practice, what the best practices and typical pitfalls are, what common patterns and habits have been established by its practitioners, and so on. At best, the system would have to simulate building a number of nontrivial systems using the language in order to discover that knowledge, and in the end it would still be like someone locked in a room without knowledge of how the language is actually applied in the real world.
oblio
an hour ago
> sufficiently smart LLM
Cousin of the sufficiently smart compiler? :-p
mycall
6 hours ago
I thought synthetic data is what is partially training the new multimodal large models, i.e. AlphaGeometry, o1, etc.
y7
6 hours ago
Synthetic data can never contain more information than the statistical model from which it is derived: it is simply the evaluation of a non-deterministic function on the model parameters. And the model parameters are simply a function of the training data.
I don't see how you can "bootstrap a smarter model" based on synthetic data from a previous-gen model this way. You may as well well just train your new model on the original training data.
antisthenes
6 hours ago
Synthetic data without some kind of external validation is garbage.
E.g. you can't just synthetically generate code, something or someone needs to run it and see if it performs the functions you actually asked of it.
You need to feed the LLM output into some kind of formal verification system, and only then add it back to the synthetic training dataset.
Here, for example - dumb recursive training causes model collapse:
jneagu
6 hours ago
Yeah, There was a reference in a paywalled article a year ago (https://www.theinformation.com/articles/openai-made-an-ai-br...): "Sutskever's breakthrough allowed OpenAI to overcome limitations on obtaining high-quality data to train new models, according to the person with knowledge, a major obstacle for developing next-generation models. The research involved using computer-generated, rather than real-world, data like text or images pulled from the internet to train new models."
I suspect most foundational models are now knowingly trained on at least some synthetic data.
epgui
6 hours ago
In a very real sense, that’s also how human brains work.
elicksaur
6 hours ago
This argument always conflates simple processes with complex ones. Humans can work with abstract concepts at a level LLMs currently can’t and don’t seem likely capable of. “True” and “False” are the best examples.
epgui
6 hours ago
It doesn’t conflate anything though. It points to exactly that as a main difference (along with comparative functional neuroanatomy).
It’s helpful to realize the ways in which we do work the same way as AI, because it gives us perspective unto ourselves.
(I don’t follow regarding your true and false statement, and I don’t share your apparent pessimism about the fundamental limits of AI.)
empath75
6 hours ago
AI companies are already paying humans to produce new data to train on and will continue to do that. There's also additional modalities -- they've already added text, video, and audio, and there's probably more possible. Right now almost all the content being fed into these AIs is stuff that humans can sense and understand, but why does it have to limit itself to that? There's probably all kinds of data types it could train on that could give it more knowledge about the world.
Even limiting yourself to code generation, there are going to be a lot of software developers employed to write or generate code examples and documentation just for AIs to ingest.
I think eventually AIs will begin coding in programming languages that are designed for AI to understand and work with and not for people to understand.
imoverclocked
5 hours ago
> AI companies are already paying humans to produce new data to train on and will continue to do that.
The sheer difference in scale between the domain of “here are all the people in the world that have shared data publicly until now” and “here is the relatively tiny population of people being paid to add new information to an LLM” dooms the LLM to become outdated in an information hoarding society. So, the question in my mind is, “Why will people keep producing public information just for it to be devalued into LLMs?”
manmal
42 minutes ago
How would a custom language differ from what we have now?
If you mean obfuscation, then yeah, maybe that makes sense to fit more into the window. But it’s easy to unobfuscate, usually.
Otherwise, I‘m not sure what the goal of an LLM specific language could be. Because I don’t feel most languages have been made purely to accommodate humans anyway, but they balance a lot of factors, like being true to the metal (like C) or functional purity (Haskell) or fault tolerance (Erlang). I‘m not sure what „being for LLMs“ could look like.
jneagu
6 hours ago
Edit: OP had actually qualified their statement to refer to only underrepresented coding languages. That's 100% true - LLM coding performance is super biased in favor of well-represented languages, esp. in public repos.
Interesting - I actually think they perform quite well on code, considering that code has a set of correct answers (unlike most other tasks we use LLMs for on a daily basis). GitHub Copilot had a 30%+ acceptance rate (https://github.blog/news-insights/research/research-quantify...). How often does one accept the first answer that ChatGPT returns?
To answer your first question: new content is still being created in an LLM-assisted way, and a lot of it can be quite good. The rate of that happening is a lot lower than that of LLM-generated spam - this is the concerning part.
generic92034
6 hours ago
The OP has qualified "code" with bad availability of samples online. My experience with LLMs on a proprietary language with little online presence confirms their statement. It is not even worth trying, in many cases.
jneagu
6 hours ago
Fair point - I actually had parsed OP's sentence differently. I'll edit my comment.
I agree, LLMs performance for coding tasks is super biased in favor of well-represented languages. I think this is what GitHub is trying to solve with custom private models for Copilot, but I expect that to be enterprise only.