stavros
8 days ago
I have a small physics-based problem I pose to LLMs. It's tricky for humans as well, and all LLMs I've tried (GPT o3, Claude 3.7, Gemini 2.5 Pro) fail to answer correctly. If I ask them to explain their answer, they do get it eventually, but none get it right the first time. Qwen3 with max thinking got it even more wrong than the rest, for what it's worth.
mrkeen
8 days ago
As they say, we shouldn't judge AI by the current state-of-the-art, but by how far and fast it's progressing. I can't wait to see future models get it even more wrong than that.
kaoD
8 days ago
Personally (anecdata) I haven't experienced any practical progress in my day-to-day tasks for a long time, no matter how good they became at gaming the benchmarks.
They keep being impressive at what they're good at (aggregating sources to solve a very well known problem) and terrible at what they're bad at (actually thinking through novel problems or old problems with few sources).
E.g. all ChatGPT, Claude and Gemini were absolutely terrible at generating Liquidsoap[0] scripts. It's not even that complex, but there's very little information to ingest about the problem space, so you can actually tell they are not "thinking".
jim180
8 days ago
Absolutely. All models ar terrible with Objective-C and Swift, compared to let's say JS/HTML/Python.
However, I've realized that Claude Code is extremely useful for generating somewhat simple landing pages for some of my projects. It spits out static html+js which is easy to host, with somewhat good looking design.
The code isn't the best and to some extent isn't maintainable by a human at all, but it gets the job done.
copperroof
8 days ago
I’ve gotten 0 production usable python out of any LLM. Small script to do something trivial, sure. Anything I’m going to have to maintain or debug in the future, not even close. I think there is a _lot_ of terrible python code out there training LLMs, so being a more popular language is not helpful. This era is making transparent how low standards really are.
overfeed
8 days ago
> I’ve gotten 0 production usable python out of any LLM
Fascinating, I wonder how you use it because once I decompose code to modules and function signatures, Claude[0] is pretty good at implementing Python functions. I'd say it one-shots 60% of the times, I have to tweak the prompt or adjust the proposed diffs 30%, and the remaining 10% is unusable code that I end up writing by hand. Other things Claude is even better at: writing tests, simple refactors within a module, authoring first-draft docstrings, adding context-appropriate type hints.
0. Local LLMs like Gemma3, Qwen-coder seem to be in the same ballpark in terms of capabilities, it's just that they are much slower on my hardware. Except for the 30b Qwen3 MoE that was released a day ago, that one is freakin' fast.
cmorgan31
8 days ago
I agree - you have to treat them like juniors and provide the same context you would someone who is still learning. You can’t assume it’s correct but where it doesn’t matter it is a productivity improvement. The vast majority of the code I write doesn’t even go into production so it’s fantastic for my usage.
the13
a day ago
What happens to the vast majority of code you write
thelittleone
8 days ago
Different experience here. Production code in banking and finance for backend data analysis and reporting. Sure the code isn't perfect, but doesn't need to be. It's saving >50% effort and the analysis results and reporting are of at least as good a standard as human developed alternatives.
startupsfail
8 days ago
Try o4-mini-high. It’s getting there.
motbus3
8 days ago
Maybe with the next got version, gpt-4.003741
apercu
8 days ago
Interesting, I'll have to try that. All the "static" page generators I've tried require React....
jimvdv
8 days ago
I like using Vercel v0 for frontend
ggregoryarms
8 days ago
Building a basic static html landing page is ridiculously easy though. What js is even needed? If it's just an html file and maybe a stylesheet of course it's easy to host. You can apply 20 lines of css and have a decent looking page.
These aren't hard problems.
snoman
8 days ago
A big part of my job is building proofs of concept for some technologies and that usually means some webpage to visualize that the underlying tech is working as expected. It’s not hard, doesn’t have to look good at all, and will never be maintained. I throw it away a few weeks later.
It used take me an hr or two to get it all done up properly. Now it’s literal seconds. It’s a handy tool.
sheepscreek
8 days ago
> These aren’t hard problems.
Honestly, that’s the best use-case for AI currently. Simple but laborious problems.
jim180
8 days ago
Laziness mostly - no need to think about design, icons and layout (responsiveness and all that stuff).
These are not hard problems obviously, but getting to 80%-90% is faster than doing it by hand and in my cases that was more than enough.
With that being said, AI failed for the rest 10%-20% with various small visual issues.
apercu
8 days ago
> These aren't hard problems.
So why do so many LLMs fail at them?
bboygravity
8 days ago
And humans also.
prox
8 days ago
Absolutely, as soon as they hit that mark where things get really specialized, they start failing a lot. They do generalizations on well documented areas pretty good. I only use it for getting a second opinion as it can search through a lot of documents quickly and find me alternative means.
Filligree
8 days ago
They have broad knowledge, a lot of it, and they work fast. That should be a useful combination-
And indeed it is. Essentially every time I buy something these days, I use Deep Research (Gemini 2.5) to first make a shortlist of options. It’s great at that, and often it also points out issues I wouldn’t have thought about.
Leave the final decisions to a super slow / smart intelligence (a human), by all means, but for people who claim that LLMs are useless I can only conclude that they haven’t tried very hard.
darepublic
8 days ago
Yes similar experience querying gpt about lesser known frameworks. Had o1 stone cold hallucinate some non existent methods I could find no trace of from googling. Would not budge on the matter either. Basically you have to provide the key insight yourself in these cases to get it unstuck, or just figure it out yourself. After its dug into a problem to some degree you get a feel for whether continued prompting on the subject is going to be helpful or just more churn
krosaen
8 days ago
I'm curious what kind of prompting or context you are providing before asking for a liquid soap script - or if you've tried using Cursor and providing a bunch of context with documentation about liquid soap as part of it. My guess was these kinds of things get the models to perform much better. I have seen this work with internal APIs / best practices / patterns.
kaoD
8 days ago
Yes, I used Cursor and tried providing both the whole Liquidsoap book or the URL to the online reference just in case the book was too large for context or it was triggering some sort of RAG.
Not successful.
It's not that it didn't do what I wanted: most of the time it didn't even run. Iterating on the error messages just arrived at progressively dumber not-solutions and running in circles.
krosaen
8 days ago
Oh man, that's dissapointing.
senordevnyc
8 days ago
What model?
kaoD
8 days ago
I'm on Pro two-week trial so I tried a mix of mainstream premium models (including reasoning ones) + letting Cursor route me to the "best" model or whatever they call it.
matrixhack
7 days ago
[flagged]
jang07
8 days ago
this problem is always going to exist in these models, these models are hungry for good data
if there is focus on improving the model on something, the method do it is known, its just about priority
42lux
8 days ago
Haven’t seen much progress in base models since gpt4. Deep thinking and whatever else came in the last year are just bandaids hiding the shortcomings of said models and were achievable before with the right tooling. The tooling got better the models themselves are just marginally better.
kenjackson
8 days ago
You really had me until the last half of the last sentence.
stavros
8 days ago
The plural of anecdote is data.
tankenmate
8 days ago
"The plural of anecdote is data.", this is right up there with "1 + 1 = 3, for sufficiently large values of 1".
Had an outright genuine guffaw at this one, bravo.
rtaylorgarlock
8 days ago
Only in the same way that the plural of 'opinion' is 'fact' ;)
stavros
8 days ago
Except, very literally, data is a collection of single points (ie what we call "anecdotes").
cess11
8 days ago
No. Anecdote, ἀνέκδοτον, is a story that points to some abstract idea, commonly having something to do with morals. The word means 'not given out'/'not-out-given'. Data is the plural of datum, and arrives in english not from greek, but from latin. The root is however the same as in anecdote, and datum means 'given'. Saying that 'not-given' and 'collection of givens' is the same is clearly nonsensical.
A datum has a value and a context in which it was 'given'. What you mean by "points" eludes me, maybe you could elaborate.
rwj
8 days ago
Except that the plural of anecdotes is definitely not data, because without controlling for confounding variables and sampling biases, you will get garbage.
fhd2
8 days ago
Based on my limited understanding of analytics, the data set can be full of biases and anomalies, as long as you find a way to account for them in the analysis, no?
LegionMammal978
8 days ago
The accuracy of your analysis becomes limited to the accuracy of how well you correct for the biases. And it's difficult to measure the bias accurately without lots of good data or cross-examination.
scubbo
8 days ago
Garbage data is still data, and data (garbage or not) is still more valuable than a single anecdote. Insights can only be distilled from data, by first applying those controls you mentioned.
jimmySixDOF
8 days ago
Or you can apply the Bezos/Amazon anecdote about anecdotes:
At a managers meeting "user stories" about poor support but all the KPIs looked good from the call center so Jeff dials in the number from the meeting speaker phone, gets put on hold, IVR spin cycle, hold again, etc .... His take away was basically "if the data and anecdotes don't match always default to the customer stories".
user
8 days ago
user
8 days ago
bcoates
8 days ago
No, Wittgenstein's rule following paradox, Shannon sampling theorem, the law that infinite polynomials pass through any finite set of points (does that have a name?), etc, etc. are all equivalent at the limit to the idea that no amount of anecdotes-per-se add up to anything other than coincidence
whatnow37373
8 days ago
Without structural assumptions, there is no necessity - only observed regularity. Necessity literally does not exist. You will never find it anywhere.
Hume figured this out quite a while ago and Kant had an interesting response to it. Think the lack of “necessity” is a problem? Try to find “time” or “space” in the data.
Data by itself is useless. It’s interesting to see peoples’ reaction to this.
bijant
8 days ago
@whatnow37373 — Three sentences and you’ve done what a semester with Kritik der reinen Vernunft couldn’t: made the Hume-vs-Kant standoff obvious. The idea that “necessity” is just the exhaust of our structural assumptions (and that data, naked, can’t even locate time or space) finally snapped into focus.
This is exactly the kind of epistemic lens-polishing that keeps me reloading HN.
tankenmate
8 days ago
This thread has given me the best philosophical chuckle I've had this year. Even after years of being here, HN can still put an unexpected smile on your face.
Der_Einzige
8 days ago
Anti-realism, indeterminancy, intuitionism, and radical subjectivity are extremely unpopular opinions here. Folks here are to dense to imagine that the cogito is fake bullshit and wrong. You're fighting an extremely uphill battle.
Paul Feyerabend is spinning in his grave.
inimino
8 days ago
No, no, no. Each of them gives you information.
bcoates
8 days ago
In the formal, information-theory sense, they literally don't, at least not on their own without further constraints (like band-limiting or bounded polynomial degree or the like)
acchow
8 days ago
“Plural of anecdote is data” is meant to be tongue-in-cheek.
Actual data is sampled randomly. Anecdotes very much are not.
9rx
7 days ago
> “Plural of anecdote is data” is meant to be tongue-in-cheek.
To be fair, possibly because of this tongue-in-cheek statement, anecdote is sometimes used in place of datum and not as anecdote is usually defined.
9rx
8 days ago
Technically we call it a datum. An anecdote is a story, not a point.
But it is true that colloquially anecdote is sometimes used in place of datum.
absolutelastone
8 days ago
one point is a collection of size 1. It is always data.
dymk
8 days ago
wizardforhire
8 days ago
Ahhhahhahahaha stavros is so right but this is such high level bickering I haven't laughed so hard in a long time. Ya’ll are awesome! dymk you deserve a touché for this one.
The challenge for sharing data at this stage of the game is that the game is rigged in datas favor. So stavros I hear you.
To clarify, if we post our data it’s just going to get fed back into the models making it even harder to vet iterations as they advance.
dataf3l
8 days ago
I think somebody said it may be 'anecdata'
WhitneyLand
8 days ago
The plural of reliable data is not anecdote.
user
8 days ago
tomrod
8 days ago
Depends on the data generating process.
WhitneyLand
8 days ago
Of course, but then you have a system of gathering information with some rigor which is more than merely a collection of anecdotes. That becomes the difference.
windowshopping
8 days ago
"For what it's worth"? What's wrong with that?
Jordan-117
8 days ago
That's the last third of the sentence.
nopinsight
8 days ago
Current models are quite far away from human-level physical reasoning (paper below). An upcoming version of models trained on world simulation will probably do much better.
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
horhay
8 days ago
This is more about a physics math aptitude test. You can already see that the best model in math is saturating it halfway. It might not indicate its usefulness in actual physical reasoning, or at the very least, it seems like a bit of a stretch.
furyofantares
8 days ago
I similarly have a small, simple spatial reasoning problem that only reasoning models get right, and not all of them, and which Qwen3 on max reasoning still gets wrong.
> I put a coin in a cup and slam it upside-down on a glass table. I can't see the coin because the cup is over it. I slide a mirror under the table and see heads. What will I see if I take the cup (and the mirror) away?
UltraSane
8 days ago
Sonnet 3.7 non-reasoning got it right.
I'll think this through step by step. When you place a coin in a cup and slam it upside-down on a glass table, the coin will be between the table surface and the cup. When you look at the reflection in the mirror beneath the table, you're seeing the bottom side of the coin through the glass. Since the mirror shows heads, you're seeing the heads side of the coin reflected in the mirror. In a mirror, the image is reversed left-to-right, but the "heads" and "tails" sides of a coin would not be flipped in this reflection - you'd still recognize heads as heads. If you take away both the cup and the mirror, you'll see the opposite side of the coin compared to what you saw in the mirror. Since the mirror showed heads, you'll see tails when looking directly at the coin from above the table.
natrys
8 days ago
My first try (omitting chain of thought for brevity):
When you remove the cup and the mirror, you will see tails. Here's the breakdown:
Setup: The coin is inside an upside-down cup on a glass table. The cup blocks direct view of the coin from above and below (assuming the cup's base is opaque).
Mirror Observation: A mirror is slid under the glass table, reflecting the underside of the coin (the side touching the table). If the mirror shows heads, this means the underside of the coin is heads.
Coin Orientation: A coin has two sides: heads and tails. If the underside (facing the table) is heads, the top side (facing away from the table) must be tails.
Result: Removing the cup reveals the coin on the table. Since the top side was tails all along, you will see tails.
Key Insight: The mirror reveals the hidden underside of the coin, but the top side (visible after removing the cup) is its opposite. The mirror’s reflection does not alter the physical orientation of the coin’s sides—it simply reflects the existing state.
Answer: Tails.
SamBam
8 days ago
Huh, for me it said:
Answer: You will see the same side of the coin that you saw in the mirror — heads .
Why?
The glass table is transparent , so when you look at the coin from below (using a mirror), you're seeing the top side of the coin (the side currently facing up). Mirrors reverse front-to-back , not left-to-right. So the image is flipped in depth, but the orientation of the coin (heads or tails) remains clear. Since the coin hasn't moved during this process, removing the cup and mirror will reveal the exact same face of the coin that was visible via the mirror — which was heads.
Final Answer: You will see heads.
viraptor
8 days ago
The question doesn't define which side you're going to look from at the end, so either looking down or up is valid.
vunderba
8 days ago
While technically true, one could argue that because they described sliding a mirror underneath the table, that they did this in order to be able to see from that particular angle, so the implication for any reasonable answer would be that are looking down at the table.
viraptor
7 days ago
The way the question is phrased, I'm definitely in the "ok, what's the trick, what do you want to hear" mode. I'm not surprised LLMs answer in a weird way with extra details like "this is how mirrors work, is this what you're confused about?"
Rooki
8 days ago
Manus just answered this correctly for me in one crack:
Me: I put a coin in a cup and slam it upside-down on a glass table. I can't see the coin because the cup is over it. I slide a mirror under the table and see heads. What will I see if I take the cup (and the mirror) away?
Manus: This is a classic riddle! Let's think it through. You placed a coin in a cup, turned it upside down on a glass table, and then slid a mirror underneath the table. Looking into the mirror from below, you saw 'heads'. Since the mirror reflects the side of the coin facing downwards through the glass table, the 'heads' side must be facing down. Therefore, the side facing up, hidden inside the cup, must be 'tails'. When you remove the cup, you will see the side facing up, which is tails.
staticman2
8 days ago
Except it's wrong about this being a classic riddle.
SamBam
8 days ago
Yup, it flunked that one.
I also have a question that LLMs always got wrong until ChatGPT o3, and even then it has a hard time (I just tried it again and it needed to run code to work it out). Qwen3 failed, and every time I asked it to look again at its solution it would notice the error and try to solve it again, failing again:
> A man wants to cross a river, and he has a cabbage, a goat, a wolf and a lion. If he leaves the goat alone with the cabbage, the goat will eat it. If he leaves the wolf with the goat, the wolf will eat it. And if he leaves the lion with either the wolf or the goat, the lion will eat them. How can he cross the river?
I gave it a ton of opportunities to notice that the puzzle is unsolvable (with the assumption, which it makes, that this is a standard one-passenger puzzle, but if it had pointed out that I didn't say that I would also have been happy). I kept trying to get it to notice that it failed again and again in the same way and asking it to step back and think about the big picture, and each time it would confidently start again trying to solve it. Eventually I ran out of free messages.
novaRom
8 days ago
4o with thinking:
By systematic (BFS) search of the entire 32-state space under these rules, one finds no path from to that stays always safe. Thus the puzzle has no solution—there is no way for the man to ferry all four items across without at least one of them being eaten.
mavamaarten
8 days ago
You go with the cabbage, goat, wolf and lion all together!
SamBam
8 days ago
O3 gave me basically that solution. "Below is the shortest safe schedule that really works ‒ but it assumes the boat can hold the man plus two passengers (three beings total). If your version of the puzzle only lets him move one passenger at a time, the puzzle has no solution: at the very first trip he would always leave at least one forbidden pair alone."
cyprx
8 days ago
i tried grok 3 with Think and it was right also with pretty good thinking
SamBam
8 days ago
I don't have access to Think, but I tried Grok 3 regular, and it was hilarious, one of the longest answers I've ever seen.
Just giving the headings, without any of the long text between each one where it realizes it doesn't work, I get:
Solution
[... paragraphs of text ommitted each time]
Issue and Revision
Revised Solution
Final Solution
Correct Sequence
Final Working Solution
Corrected Final Solution
Final Correct Solution
Successful Solution
Final answer
Correct Final Sequence
Final Correct Solution
Correct Solution
Final Working Solution
Correct Solution
Final Answer
Final Answer
Each time it's so confident that it's worked out the issue, and now, finally, it has the correct, final, working solution. Then it blows it again.I'm surprised I didn't start seeing heading titles such as "Working solution-FINAL (3) revised updated ACTUAL-FINAL (2)"
vunderba
8 days ago
The only thing I don't like about this test is that I prefer test questions that don't have binary responses (e.g. heads or tails) - you can see from the responses that you got from the thread that the LLMs success rates are all over the map.
furyofantares
8 days ago
Yeah, same.
I had a more complicated prompt that failed much more reliably - instead of a mirror I had another person looking from below. But it had some issues where Claude would often want to refuse on ethical grounds, like I'm working out how to scam people or something, and many reasoning models would yammer on about whether or not the other person was lying to me. So I simplified to this.
I'd love another simple spatial reasoning problem that's very easy for humans but LLMs struggle with, which does NOT have a binary output.
Lucasoato
8 days ago
I tried with the thinking option on and it gets into some networking errors, if you don’t turn on the thinking it guesses the answer correctly.
> Summary:
- Mirror shows: *Heads* → That's the *bottom face* of the coin. - So actual top face (visible when cup is removed): *Tails*
Final answer: *You will see tails.*
tamat
8 days ago
I always feel that if you share a problem here where LLMs fail, it will end up in their training set and it wont fail to that problem anymore, which means the future models will have the same errors but you have lost your ability to detect them.
senordevnyc
8 days ago
My favorite part of the genre of “questions an LLM still can’t answer because they’re useless!” is all the people sharing results from different LLMs where they clearly answer the question correctly.
furyofantares
8 days ago
I use LLMs extensively and probably should not be bundled into that genre as I've never called LLMs useless.
yencabulator
7 days ago
I think it's pretty random. qwen3:4b got it correct once, on re-run it told me the coin is actually behind the mirror, and then did this brilliant maneuver:
- The question is **not** asking for the location of the coin, but its **identity**.
- The coin is simply a **coin**, and the trick is in the riddle's wording.
---
### Final Answer:
$$
\boxed{coin}
$$
user
8 days ago
artemisart
8 days ago
ChatGPT free gets it right without reasoning mode (still explained some steps) https://chatgpt.com/share/6810bc66-5e78-8001-b984-e4f71ee423...
hmottestad
8 days ago
Tried it with o1-pro:
> You’ll find that the actual face of the coin under the cup is tails. Seeing “heads” in the mirror from underneath indicates that, on top, the coin is really tails‑up.
viraptor
8 days ago
Simple Claude 3.5 with no reasoning gets it right.
furyofantares
8 days ago
Neat, because 3.7 doesn't.
viraptor
8 days ago
3.7 works for me too. Of course it's going to depend on temperature a bit.
animal531
8 days ago
They all are using these tests to determine their worth, but to be honest they don't convert well to real world tests.
For example I tried Deepseek for code daily over a period of about two months (vs having used ChatGPT before), and its output was terrible. It would produce code with bugs, break existing code when making additions, totally fail at understanding what you're asking etc.
ggregoryarms
8 days ago
Exactly. If I'm going to be solving bugs, I'd rather they be my own.
spaceman_2020
8 days ago
I don’t know about physics, but o3 was able to analyze a floor plan and spot ventilation and circulation issues that even my architect brother wasn’t able to spot in a single glance
Maybe it doesn’t make physicists redundant, but it’s definitely making expertise in more mundane domains way more accessible
throwaway743
8 days ago
My favorite test is "Build an MTG Arena Deck in historic format around <strategy_and_or_cards> in <these_colors>. It must be exactly 60 cards and all cards must be from Arena only. Search all sets/cards currently availble on Arena, new and old".
Many times they’ll include cards that are only available in paper and/or go over the limit, and when asked to correct a mistake they'll continue to make mistakes. But recently I found that Claude is pretty damn good now at fixing its mistakes and building/optimizing decks for Arena. Asked it to make a deck based on insights it gained from my current decklist, and what it came up with was interesting and pretty fun to play.
baxtr
8 days ago
This reads like a great story with a tragic ending!
concrete_head
8 days ago
Can you please share the problem?
stavros
8 days ago
I don't really want it added to the training set, but eh. Here you go:
> Assume I have a 3D printer that's currently printing, and I pause the print. What expends more energy, keeping the hotend at some temperature above room temperature and heating it up the rest of the way when I want to use it, or turning it completely off and then heat it all the way when I need it? Is there an amount of time beyond which the answer varies?
All LLMs I've tried get it wrong because they assume that the hotend cools immediately when stopping the heating, but realize this when asked about it. Qwen didn't realize it, and gave the answer that 30 minutes of heating the hotend is better than turning it off and back on when needed.
bufferoverflow
8 days ago
What kind of answer do you expect? It all depends on the hotend shape and material, temperature differences, how fast air moves in the room, humidity of the air, etc.
dtech
8 days ago
Keeping something above room temperature will always use more energy than letting it cool down and heating it back up when needed
sampo
8 days ago
> It all depends on
No it doesn't.
user
7 days ago
claiir
8 days ago
Sounds like the LLM you used when writing this slop comment struggled with the problem too. :>
oscord
8 days ago
Qwen3-32b did it pretty accurately it seems. Calculated heat loss over time going to ambient temp, offered to keep it at standby 100C for short breaks under 10 minutes. Shut down completely for longer breaks.
stavros
8 days ago
The correct answer is that it's always better to turn it off, though.
Filligree
8 days ago
Unless you care about warmup time. LLMs have a habit of throwing in common-sense assumptions that you didn’t tell it to, so you have to be careful of that.
It’s not a bug. Outside of logic puzzles that’s a very good thing.
stavros
8 days ago
No, warmup time doesn't change anything, I can simply factor it in.
It is a bug, because I asked it precisely what I wanted, and it gave the wrong answer. It didn't say anything about warmup time, it was just wrong.
andrewmcwatters
8 days ago
Ah! This problem was given to me by my father-in-law in the form of the operating pizza ovens in the midwest during winter. It's a neat, practical one.
pylotlight
8 days ago
Some calculation around heat loss and required heat expenditure to reheat per material or something?
stavros
8 days ago
Yep, except they calculate heat loss and required energy to keep heating, but room temperature and energy required to heat from that in the other case, so they wildly overestimate one side of the problem.
bcoates
8 days ago
Unless I'm missing something holding it hot is pure waste.
markisus
8 days ago
Maybe it will help to have a fluid analogy. You have a leaky bucket. What wastes more water, letting all the water leak out and then refilling it from scratch, or keeping it topped up? The answer depends on how bad the leak is vs how long you are required to maintain the bucket level. At least that’s how I interpret this puzzle.
Torkel
8 days ago
Does it depend though?
The water (heat) leaking out is what you need to add back. As water level drops (hotend cools) the leaking will slow. So any replenishing means more leakage then you are eventually paying for by adding more water (heat) in.
markisus
8 days ago
You can stipulate conditions to make the solution work out in either direction.
Suppose the bucket is the size of lake, and the leak is so miniscule that it takes many centuries to detect any loss. And also I need to keep the bucket full for a microsecond. In this case it is better to keep the bucket full, than to let it drain.
Now suppose the bucket is made out of chain-link and any water you put into it immediately falls out. The level is simply the amount of water that happens to be passing through at that moment. And also the next time I need the bucket full is after one century. Well in that case, it would be wasteful to be dumping water through this bucket for a century.
bcoates
8 days ago
All heat that is lost must be replaced (we must input enough heat that the device returns to T_initial)
Hotter objects lose heat faster, so the longer we delay restoring temperature (for a fixed resume time) the less heat is lost that will need replacement.
Hotter objects require more energy to add another unit of heat, so the cooler we allow the device to get before re-heating (again, resume time is fixed) the more efficient our heating can be.
There is no countervailing effect to balance, preemptive heating of a device before the last possible moment is pure waste no matter the conditions (although the amount of waste will vary a lot, it will always be a positive number)
Even turning the heater off for a millisecond is a net gain.
gparke
7 days ago
Does it depend on whether you know in advance _when_ you need it back at the hot temperature?
If you don’t think ahead and simply switch the heater back on when you need it, then you need the heater on for_longer_.
That means you have to pay back the energy you lost, but also the energy you lose during the reheating process. Maybe that’s the countervailing effect?
> Hotter objects require more energy to add another unit of heat
Not sure about this. A unit of heat is a unit energy, right? Maybe you were thinking of entropy?
user
7 days ago
herdrick
8 days ago
No, you should always wait until the last possible moment to refill the leaky bucket, because the less water in the bucket, the slower it leaks, due to reduced pressure.
dTal
8 days ago
Allowing it to cool below the phase transition point of the melted plastic will cause it to release latent heat, so there is a theoretically possible corner case where maintaining it hot saves energy. I suspect that you are unlikely to hit this corner case, though I am too lazy to crunch the numbers in this comment.
yishanchuan
8 days ago
don't worry, it is really trickly for training
claiir
8 days ago
Same experience with my personal benchmarks. Generally unimpressed with Qwen3.
laurent_du
8 days ago
I do the same with a small math problem and so far only Qwen3 got it right (tested all thinking models). So your mileage may vary, as they say!
the13
a day ago
Show prompt
mromanuk
8 days ago
I was expecting a different outcome, that you tell us that Qwen3 nailed at first.
phonon
8 days ago
Qwen3-235B-A22B?
stavros
8 days ago
Yep, on Qwen chat.
user
8 days ago
nothink3194
8 days ago
[dead]
user
8 days ago
arthurcolle
8 days ago
Hi, I'm starting an evals company, would love to have you as an advisor!
999900000999
8 days ago
Not OP, but what exactly do I need to do.
I'll do it for cheap if you'll let me work remote from outside the states.
refulgentis
8 days ago
I believe they're kidding, playing on "my singular question isn't answered correctly"
arthurcolle
5 days ago
Absolutely, please reach out! arthur@distributed.systems
sieabahlpark
8 days ago
[dead]