lordnacho
20 hours ago
My pet theory is similar to the training set hypothesis: em-dashes appear often in prestige publications. The Atlantic, The New Yorker, The Economist, and a few others that are considered good writing. Being magazines, there's a lot of articles over time, reinforcing the style. They're also the sort of thing a RLHF person will think is good, not because of the em-dash but because the general style is polished.
One thing I wondered is whether high prestige writing is encoded into the models, but it doesn't seem far fetched that there's various linkages inside the data to say "this kind of thing should be weighted highly."
kubb
20 hours ago
It also seems that LLMs are using them correctly — as a pause or replacement for a comma (yes, I know this is an imprecise description of when to use them).
Thanks to LLMs I learned that using the short binding dash everywhere is incorrect, and I can improve my writing because of it.
number6
13 hours ago
Before the rise of the llms there was a post here on hn where someone explained how to use all the dashes — sadly llms took them from us
cornonthecobra
17 hours ago
This is mine as well, with the addition of books. If someone wanted to train a bot to sound more human, they would select data that is verifiably human-made.
The approachable tone of popular print media also preselects for the casual, highly-readable style I suspect users would want from a bot.
tim333
16 hours ago
That kind of fits with Altman saying they put them in because users liked them (https://www.linkedin.com/posts/curtwoodward_chatgpt-em-dash-...)
I guess in the past if you'd shown me a passage with em dashes I'd say it looks good because I associate it with the New Yorker and Economist, both of which I read. Now I'd be a bit more meh due to LLMs.