npalli
7 days ago
Kudos, I think (in the short term at least) there is a large amount of perf. optimization to be found by coding parts of the whole AI/ML infrastructure in C++ like this one, not as a rewrite (god no!) but drop in and fix key bottlenecks. Anytime I see someone (seems Chinese engineers are good at this) put something out in C++, good chance some solid engineering tradeoffs have been made and dramatic improvement will be seen.
matthewolfe
7 days ago
Agreed. A former mentor of mine told me a nice way of viewing software development:
1. Make it work. 2. Make it fast. 3. Make it pretty.
Transformers & LLMs have been developed to a point where they work quite well. I feel as though we're at a stage where most substantial progress is being made on the performance side.
diggan
7 days ago
Heh, seems people I've been learning from been biased away from beauty, as I know that as "Make It Work, Make It Right, Make It Fast".
kevindamm
7 days ago
I've usually heard/said it as
1. Make it
2. Make it work
3. Make it work better
(different circumstances have different nuances about what "better" means, it isn't always performance optimization; some do substitute "faster" for "better" here, but I think it loses generality then).acosmism
7 days ago
i like this version best
gabrielhidasy
7 days ago
I always heard the "Make it Right" as "Make it Beautiful", where Right and Beautiful would mean "non-hacky, easily maintainable, easily extendable, well tested, and well documented"
mindcrime
7 days ago
I've always heard it (and said it) as:
1. Make it work
2. Make it correct
3. Make it fast
abybaddi009
7 days ago
What's the difference between make it work and make it right? Aren't they the same thing?
gopalv
7 days ago
> make it work and make it right?
My mentor used say it is the difference between a screw and glue.
You can glue some things together and prove that it works, but eventually you learn that anytime you had to break something to fix it, you should've used a screw.
It is trade off in coupling - the glue binds tightly over the entire surface but a screw concentrates the loads, so needs maintenance to stay tight.
You only really know which is "right" it if you test it to destruction.
All of that advice is probably sounding date now, even in material science the glue might be winning (see the Tesla bumper or Lotus Elise bonding videos - every screw is extra grams).
robertfw
7 days ago
Making it work can be a hacky, tech debt laden implementation. Making it right involves refactoring/rewriting with an eye towards maintainability, testability, etc etc
user
7 days ago
stavros
7 days ago
Yeah, if it's not right, it doesn't work.
darknoon
7 days ago
In ML, often it does work to a degree even if it's not 100% correct. So getting it working at all is all about hacking b/c most ideas are bad and don't work. Then you'll find wins by incrementally correcting issues with the math / data / floating point precision / etc.
gabrielhidasy
7 days ago
Depends on your definition of "right" and "work". It could be a big ball of mud that always returns exactly the required response (so it 'works'), but be hellish hard change and very picky about dependencies and environment (so it's not 'right').
stavros
7 days ago
Nope, it's right, but it's not pretty.
DSingularity
7 days ago
Not true. Things can work with hacks. Your standards might consider it unacceptable to have hacks. So you can have a “make it right” stage.
matthewolfe
7 days ago
Fair chance I'm remembering it wrong :D
binarymax
7 days ago
The Huggingface transformers lib is currently undergoing a refactor to get rid of cruft and make it more extensible, hopefully with some perf gains.
jotux
7 days ago
A similar concept dates back to 30BC: https://en.wikipedia.org/wiki/De_architectura
Firmitas, utilitas, venustas - Strong, useful, and beautiful.
saretup
7 days ago
And while we’re at it, let’s move away from Python altogether. In the long run it doesn’t make sense just because it’s the language ML engineers are familiar with.
tbalsam
7 days ago
No! This is not good.
Iteration speed trumps all in research, most of what Python does is launch GPU operations, if you're having slowdowns from Pythonland then you're doing something terribly wrong.
Python is an excellent (and yes, fast!) language for orchestrating and calling ML stuff. If C++ code is needed, call it as a module.
bigyabai
7 days ago
It makes plenty of sense. Python handles strings well, has a great package ecosystem, and is easy to write/learn for non-programmers. It can be easily embedded into a notebook (which is huge for academics) and is technically a "write once run anywhere" platform in theory. It's great.
If you think Python is a bad language for AI integrations, try writing one in a compiled language.
mdaniel
7 days ago
> has a great package ecosystem
So great there are 8 of them. 800% better than all the rest!
> If you think Python is a bad language for AI integrations, try writing one in a compiled language.
I'll take this challenge, all day, every day, so long as I and the hypothetical 'move fast and break things' have equal "must run in prod" and "must be understandable by some other human" qualifiers
What type is `array`? Don't worry your pretty head about it, feed it whatever type you want and let Sentry's TypeError sort it out <https://github.com/openai/whisper/blob/v20250625/whisper/aud...> Oh, sorry, and you wanted to know what `pad_or_trim` returns? Well that's just, like, your opinion man
bigyabai
7 days ago
Tracks with me, I don't like using Python for real programming. Try explaining any of your "Python sucks" catechisms to a second-year statistics student though. If you'd rather teach them C++, be my guest. If you want to make them indebted to proprietary infra like Mojo or CUDA, knock yourself out.
I'm still teaching them Python.
janalsncm
7 days ago
Most of that is already happening under the hood. A lot of performance-sensitive code is already written in C or cython. For example numpy, scikit learn, pandas. Lots of torch code is either C or CUDA.
ML researchers aren’t using python because they are dumb. They use it because what takes 8 lines in Java can be done with 2 or 3 (including import json) in python for example.
ipsum2
7 days ago
Sort of. The key bottlenecks are not in tokenization, but running the actual CUDA kernels. Python actually has very little overhead. (See VLLM, which is primarily in Python). So when people (like deepseek) 'rewrite in C++', they're usually just rewriting CUDA kernels to be more efficient.
notatallshaw
7 days ago
It looks like TikToken is written in Rust (https://github.com/openai/tiktoken/tree/main/src), are the gains here actually from porting to C++?
fhub
7 days ago
From the post
Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.