The Python Package Index Should Get Rid of Its Training Wheels

79 pointsposted a day ago
by AndyKelley

70 Comments

BiteCode_dev

5 hours ago

One of the reasons Python is so popular as a scripting language in science and ML is that it has a very good story for installing Frankenstein code bases made of assembly, C and Pascal sprinkled with SIMD.

I was here before Anaconda popularized the idea of binary packages for Python and inspired wheels to replace eggs, and I don't want to go back to having to compile that nightmare on my machine.

People that have that kind of idea are likely capable of running kubs containers, understand vectorization and can code a monad on the top of their head.

Half of the Python coders are struggling to use their terminal. You have Windows dev that lives in Visual Studio, teachers of high school that barely show a few functions, mathematicians that are replacing R/Matlab, biologists forced to script something to write a paper, frontend dev that just learned JS is not the only language, geographers that are supplicating their GIS system to do something it's not made for, kids messing with their dad laptop, and probably a dog somewhere.

Binary wheels are a gift from the Gods.

HelloNurse

5 hours ago

Compiling Python extensions is a nightmare because we allow Autoconf, CMake, Visual Studio, Bazel etc. to make it complicated and nonportable; when someone sets out to wrap some library for Python the quality of the result is limited by the quality of the tools and by low expectations.

A serious engineering effort along the lines of the Zig compiler would allow Python to build almost everything from source out of the box; exotic compilers and binary dependencies, not "Frankenstein code bases" per se, are the actual obstacles.

ddulaney

4 hours ago

What you’re proposing here is essentially “if we fix the C/C++ build systems environment, this would be easy!”. You’re absolutely right, but fixing that mess has been a multi-decade goal that’s gone nowhere.

One of the great victories of new systems languages like Rust and Zig is that they standardized build systems. But untangling each individual dependency’s pile of terrible CMake (or autoconf, or vcxproj) hacks is a project in itself, and it’s often a deeply political one tied up with the unique history of each project.

kristoff_it

4 hours ago

> What you’re proposing here is essentially “if we fix the C/C++ build systems environment, this would be easy!”. You’re absolutely right, but fixing that mess has been a multi-decade goal that’s gone nowhere.

Not sure I would call it easy, as it would still take a lot of effort to update how PyPI works to account for these new capabilities, but that's exactly what the Zig compiler & build system solved.

Rust is completely hands off when it comes to C/C++ dependencies, Zig can package and build them. That's why I created https://github.com/allyourcodebase/.

As I've mentioned on lobsters, look a this example `build.zig.zon` file: https://github.com/allyourcodebase/srt/blob/main/build.zig.z...

It mentions 3 dependencies:

    Haivision/srt, the upstream C++ project
    mbedtls
    googletest
When you run zig build all these 3 dependencies are downloaded and their build.zig is run if present (the first one doesn’t have a build.zig since it’s just the vanilla C/C++ upstream project that we are providing a build script for).

The work to package everything must still happen, but once it’s done correctly you get the ability to build from any host for any target, which you can literally see happening in the CI runs of that repo: https://github.com/allyourcodebase/srt/actions/runs/10982569...

This kind of skepticism is exactly why I wrote this post. The details of actually coming up with a realistic upgrade path for PyPI are certainly much more nuanced than what I wrote in the post, but the core insight is that the C/C++ build intractability problem has been solved... and that you shouldn't depend on free big tech money if you can.

HelloNurse

2 hours ago

The "upgrade path" is for individual packages (providing portable build scripts) and for client-side Python package management (actually providing and running next-generation tools), not for PyPI which already supports package metadata stating what platforms a package is compatible for.

sitkack

3 hours ago

They do, Zig ships clang. The easiest way to install clang is via pip.

    pip install ziglang

    python -m ziglang clang --version
    clang version 18.1.6 (https://github.com/ziglang/zig-bootstrap 98bc6bf4fc4009888d33941daf6b600d20a42a56)
    Target: aarch64-unknown-darwin23.6.0
    Thread model: posix
    InstalledDir: /usr/bin
Very soon, Zig will all your (code)base.

HelloNurse

4 hours ago

The typical build scripts and tools and terrible hacks that are adequate for a standalone C/C++/Fortran project are lacking for building the same project as a Python extension, which requires portability and supported, not too custom, build steps.

The ambition and usefulness of aiming for the latter higher standard is quite new: it has gone (relatively) nowhere because it has been a multi-decade non-goal, which only a small minority of users cares about.

theamk

2 hours ago

Hard-to-build python extensions are basically .so files with some special symbols exposed.

As others said, a general task to "build libfoo.so" is very complex, and currently requires a wide variety of build systems. I don't see why this will get any easier if we require this .so file to export some Python-specific symbols.

klyrs

an hour ago

You and your respondents see this as a Python problem. I see it as a Zig problem.

As in, Zig will seamlessly add Python to its build system long before Python's build story is so robust.

BiteCode_dev

2 hours ago

Problem: saving bandwidth for pypi.

Solution: unify all software build stack.

Tomorrow, chat, we will tackle world hunger in an attempt to save people from anorexia.

Like and subscribe.

aragilar

4 hours ago

Autoconf is easy, it's the highly bespoke build systems that someone thought would be a good idea that require the right phase of the moon that are the challenge.

aragilar

4 hours ago

Wheels do predate conda (though manylinux did base itself on the experience of Anaconda and Enthought's base set of libraries), and there were distributions like Enthought (or even more field specific distributions like Ureka and individuals like Christoph Gohlke) that provided binaries for the common packages.

What the conda ecosystem did was provide a not-horrible package manager that included the full stack (in a pinch to help fix up a student's git repository on a locked-down Windows system I used conda to get git, and you can get R and do R<->python connections easily), and by providing a standard repository interface (as opposed to a locked down and limited version that the other providers appeared to do), conda pushed out anyone doing something bespoke and centralised efforts so spins like conda-forge, bioconda, astroconda could focus on their niche and do it well.

tonnydourado

4 hours ago

> was here before Anaconda popularized the idea of binary packages for Python

This has incredible "I was there, Gandalf, three thousand years ago" vibes =P

choeger

6 hours ago

The analysis contains an error. Binary artifacts don't cause exponential growth in storage requirements. It's still just linear. That's also quite clearly seen when after a phase of exponential growth, the binary artifacts still only account for 75%.

So this whole strategy (actually a pitch for zigbuild) would ideally reduce the storage requirements to 25% - which would buy the whole system maybe a year or two if the growth continues.

Of course, it's a good idea to build client-side. Especially considering the security implications. But it won't fundamentally change the problem.

kristoff_it

3 hours ago

The problem is not space, it's the fact that bandwidth costs 4x the total operating income that the PSF has. Everything else is context to understand and offer some insight into this problem.

This is a pitch for Zig as a C/C++ build system because, as you can see in other comments in this submission, there's still a lot of people that don't even believe this is a solvable problem at all, while in reality Zig does a pretty good job at solving it.

Only somebody actively involved with Python can really say if and how this could help improve the situation for PyPI, but if the people who actually happen to be working on something potentially beneficial weren't allowed to speak "because it's a pitch" then what even is the point of software interoperability.

As an occasional Python user, I also think that it's silly that people keep putting containers everywhere and pre-building a ton of binaries when I know for a fact that you could trivially `zig biuld` that shit, since I do it in my projects all the time (see https://github.com/allyourcodebase/).

chippiewill

4 hours ago

> Binary artifacts don't cause exponential growth in storage requirements. It's still just linear.

I think what the author was getting at was the combinatoric explosion as you add different variants (CPU architecture, operating systems, python versions) - but I agree that "exponential growth" is not the right term to use here.

yaleman

11 hours ago

The fact that tensorflow takes up 12.9TiB is truly horrifying, and most of that because they use pypi's storage as a dumping ground for their pre-release packages. What a nightmare they've put on other people's shoulders.

theamk

2 hours ago

I think pypi should require larger packages, like tensorflow, to self-host their releases.

There is all support for that already - the pypi index file contains arbitrary URL for data file and a sha256 hash. Let pypi store the hashes, so there is no shenanigans with versions being secretly overridden, but point the actual data URLs to other servers.

(There must obviously be a balance for availability vs pypi's cost, so maybe pypi hosts only smaller files, and larger files must be self-hosted? Or pypi hosts "major releases" while pre-releases are self-hosted? And there should be manual exceptions for "projects with funding from huge corporations" and "super popular projects from solo developers"...)

aragilar

4 hours ago

I believe tensorflow does remove old pre-releases (I know other projects do), so that number I think might be fairly static?

That tensorflow is that big isn't surprising, given the install of it plus its dependencies is many gigabytes (you can see the compressed sizes of wheels on the release pages e.g. https://pypi.org/project/tensorflow/#files), and the "tensorflow" package (as opposed to the affiliated packages) based on https://py-code.org/stats is 965.7 GiB, which really only includes a relatively small number of pre-releases.

Why tenserflow is that big comes down to needing to support many different kinds of GPUs with different ecosystem versions, and I suspect the build time of them with zig cc (assuming it works, and doesn't instead require pulling in a different compiler/toolchain) would be so excessive (especially on IoT/weaker devices) that it would make the point of the exercise moot.

amoshebb

4 hours ago

Is it though? If it saves one engineer one afternoon that storage has paid for itself, and this thing has hundreds of thousands of downloads a day.

Wouldn’t it be more horrifying to force everybody who wants to use a prerelease to waste an afternoon getting it to build just to save half a hard drive?

skeledrew

2 hours ago

That's besides the point though. Yes, having prebuilt binaries is very helpful. But what happens if Fastly decides against renewing next time and there is nobody else willing to sponsor? The cost is through the sky for the PSF to handle. Where does PyPI go?

pxc

an hour ago

There are already lots of passable package managers that know how to provide working binaries for the native, non-Python dependencies of Python packages. Instead of trying to make Python packages' build processes learn how to build everything else in the world, one thing Python packages could do is just record their external dependencies in a useful way. Then package managers that are actually already designed for and committed to working with multiple programming language ecosystems could handle the rest.

This is something that could be used by Nix, Guix, Spack, as well as more conventional software distributions like Conda, Pkgsrc, MacPorts, Homebrew, etc. With the former, users could even set up per-project environments that contain those external dependencies, like virtualenvs but much more general. But the simple feature of this metadata would naturally be valuable, if provided well, to maintainers of all Linux distros and many other software distributions, where autogenerated packages are already the norm for languagea like Rust and Go, while creating such tooling for Python is riddled with thorny problems— o these two proposals are not mutually exclusive, and perhaps each is individually warranted on its own.

Enriching package metadata in this simple way has already been proposed here:

htps://peps.python.org/pep-0725/

trlampert

4 hours ago

"When Python came into existence, repeatable builds (i.e. not yet reproducible, but at least correctly functioning on more than one machine) were a pipe dream. Building C/C++ projects reliably has been an intractable problem for a long time, but that's not true anymore."

I'd dispute that. It used to be the case that building NumPy just worked, now there are Cython/meson and a whole lot of other dependency issues and the build fails.

"At the Zig Software Foundation we look up to the Python Software Foundation as a great example of a fellow 501(c)(3) non-profit organization that was able to grow an incredibly vibrant community ..."

Better don't meet your heroes. Python was a reasonable community made up and driven by creative individuals. In the last 8 years, it has been taken over by corporate bureaucrats who take credit for the remnants of that community and who will destroy it. The PSF has never done anything except for selling expensive conference tickets and taking care of its own.

rightbyte

9 hours ago

These package respositories are used in a wasteful way. Probably by thousands of CI servers spinning up blank slate docker containers etc.

strokirk

6 hours ago

CI providers should definitely start proxying PyPI with their own cache.

Numerlor

6 hours ago

I wanted to spin up a mirror locally to do simple caching for docker builds but the tooling was lacking, there was a way to do a direct mirror of pypi locally but no other way of adding custom indices

jeroenhd

5 hours ago

I think Sonatype Nexus [1] can do that relatively easily. I don't know if the OSS version is enough, but I think most people and projects should be fine.

[1]: https://www.sonatype.com/products/sonatype-nexus-oss-downloa...

eKIK

4 hours ago

We've used Nexus OSS just the way you describe and it worked great.

We simply set it up as a kind of "passthrough cache", so if it didn't have the package it fetched it from pypi, and stored it to be used the next time someone wanted to install the same package.

Apart from being nice to pypi, we also got a bit of a decrease in CI runtime, because it fetched packages from the local cache 99% of the time.

TheChaplain

5 hours ago

DevPi might be your answer I think. Couple of years ago I set it up as a proxy plus hosting my own packages locally.

Numerlor

5 hours ago

I'll take a look, I think it is something I looked at and had some issues with but it has been a couple years and the only thing I can remember is bandersnatch

robertlagrant

2 hours ago

Yeah we use Artifactory internally as a proxy for DockerHub, Pypi, NPM, etc.

zamlag

5 hours ago

At work we don't use PyPI any longer. We have our own set of curated packages, the security issues are just too great:

https://developers.slashdot.org/story/24/09/15/0030229/fake-...

https://jfrog.com/blog/revival-hijack-pypi-hijack-technique-...

https://jfrog.com/blog/leaked-pypi-secret-token-revealed-in-...

We consider switching to Java, C++ or Rust because of general quality issues with Python.

notpushkin

5 hours ago

What do Java and Rust (and your C++ package manager of choice) do to mitigate those things?

theamk

an hour ago

Don't know about Java or Rust, but in C++ it's much harder to get new packages, which works wonder to keep number of dependencies down.

In Python, installing packages is so simple, people just do it after 10 second google search - and that library can pull dozens or hundreds of dependencies with usually no review.

In C++, given that you have to manually find the library and incorporate into the build system, people generally spend a few minutes looking at the options and choosing the best one - and this includes checking for things "how long has this library been around", "does it have many users" and "is it in healthy state". And this must be repeated for each dependency as well, so that even a dozen of dependencies is a huge negative - and such library will not be used unless there are no better alternatives.

(The exception to this rule are libraries provided by your Linux distribution. Those can be easily installed by the dozen, and that's OK - the distribution makers did all the hard work for you, vetting and packaging those libraries)

This in general means a much healthier dependency state for C++, as well as much higher code quality. No one is going to add a dependency to a core library just to add a better progress bars for example.

ajrqh

5 hours ago

What is not clear about "general quality issues", i.e., issues unrelated to package management?

notpushkin

4 hours ago

OK, my bad. It is even more unclear though – do you have any particular examples of such issues?

faustin

13 hours ago

conda-forge handles the first part of this (reproducible builds) for most common platforms. The idea of rebuilding deleted artifacts on demand sounds nice in theory, but it has the complication that rebuilding something that depends on several other somethings will likely trigger a build cascade where a bunch of stuff has to get built in order. Hopefully none of those ancient build scripts require external resources hosted at dead links!

aragilar

5 hours ago

Also, this is very much assuming that the code is both C or C++ and that LLVM is the right compiler to use. Fortran is still a major part of the ecosystem, which the Zig compiler isn't going to solve. There already exists numerous options to provide compilers to the problematic platform, the fact is binary wheels (mostly) solve the issue far better than doing local builds.

Also, the large packages are typically due to the need to support the huge number of possible GPU combinations (because you care about what CUDA versions are supported).

This feels like a solution being forced on a problem (not that zig cc isn't cool), but the post has really misunderstood the issues around wheels.

AndyKelley

12 hours ago

One strategy would be making PyPI packages fetch any external resources from PyPI, or at least add PyPI URLs as mirrors for such resources.

Wowfunhappy

5 hours ago

I'm a bit confused as to why this costs so much. I thought storage was cheap?

Bandwidth is more expensive, but shouldn't be relevant to this problem. It doesn't matter whether 5 people request the same binary or 5 people request 5 different binaries for different platforms, if all the binaries are 1 gb you're transferring 5 gb of data either way.

kristoff_it

3 hours ago

Once you have control over generating the binaries, you can create a system that is more akin to what Nix offers, where other parties can setup secondary caches for their projects, and if they don't the main registry has the option to decide what to do about it without breaking any package.

Currently PyPI is railroaded into a fairly tight set of choices because it can't exercise a lot of control over binary data.

The blog post mentions this point directly.

Wowfunhappy

3 hours ago

Thanks. The blog post alludes to mirrors but I didn't realize that was the primary goal of this project.

I am somewhat skeptical that these projects will have an incentive to actually set up mirrors. For example, I imagine Tensorflow likes getting free hosting.

kristoff_it

3 hours ago

I haven't checked, but usually when it's this visible, big tech companies tend to reciprocate so I would be surprised if Google didn't sponsor the PSF in any way.

That said, once deleting prebuilt binaries doesn't break the package anymore, if Google were to not play fair, PyPI could simply delete them. Users would experience temporary discomfort (but nothing would be irremediably broken), and soon after I'm sure that Google would decide to setup a cache for TF.

PyPI would almost certainly not even need to actually do it, the implication that it could be done would probably be enough to align everybody towards what's best for the ecosystem.

bravetraveler

4 hours ago

Really depends on how you get the storage/interface with it. Cloud/VPS is some of the worst 'bang for the buck' in my experience, where S3 or dedicated can be more favorable at a point. Not all gigglebytes cost the same :)

The article may shed some light on this, I'm not aware. Haven't read yet! I've managed a few RPM package mirrors. Dedicated gear that's a few generations old with no-name providers was the best experience.

Havoc

5 hours ago

12TB for tensorflow is absurd.

rwmj

5 hours ago

Just use the system packages! Fedora, Debian, AUR, brew/macports on macOS, etc are all a thing, use them.

BiteCode_dev

5 hours ago

There are currently 570K+ projects in pypi, 60k+ in debian repos.

It can take several months of work to approve one single package to the official repos, for a single distribution. And each have different rules and setup.

Now explain to me how you think this is going to work.

Also, do you place to force everyone to use chroot or containers to replace their virtual env systems to have variations on deps? Or maybe everybody should use nix?

Do you want to do that also for JS, Ruby and PHP?

rwmj

5 hours ago

> There are currently 570K+ projects in pypi, 60k+ in debian repos.

Not all PyPi projects require C code.

> It can take several months of work to approve one single package to the official repos

This is a massive exaggeration. For Fedora it takes a couple of days, all of it being necessary review of the code and licenses. And yes, you do have to do that work, it's done by the distros themselves too.

kbolino

5 hours ago

AUR is not a source of system packages. It is a community-maintained source of package build scripts which each user must download and build themselves.

cozzyd

4 hours ago

I wish there was an easy way for pip to list what it's missing so it's easier to install dependencies via the system package manager. Also it's sad how setup tools bdist_rpm seems to be considered deprecated.

tempfile

4 hours ago

This is how it should be, but it isn't how it is. The system packages are generally very outdated, and don't support multiple versions on a single system (which makes dependency locking a fantasy).

I dream of the day where we only have one package manager on a single machine.

echelon

5 hours ago

Python builds weren't really hermetic before, but now your entire operating system is a part of the software definition.

This is a slide backwards.

aragilar

4 hours ago

The OS always was part of the software definition, and always will be (unless you run without an OS/write your own).

echelon

3 hours ago

How much of the host system, though? You want it to be as thin as possible. And ideally it wouldn't be a part of the definition at all.

cozzyd

4 hours ago

Plenty of python packages depend on various system packages outside of PyPI anyway.

kristoff_it

3 hours ago

Yeah, and that's another source of jankyness that could be avoided

ForHackernews

5 hours ago

Doesn't brew compile from source every time on the user's machine?

notpushkin

5 hours ago

They do have binary packages now I think.

atemerev

4 hours ago

Nope. System packages and language dependencies are better not to be mixed, and the horrors if you do that are very real.

JohnMakin

4 hours ago

I don’t pretend to know the answer here, but python is unavoidable in my work and the packaging is a constant source of irritation for me. Disclaimer: I am not pretending to be a python expert here but coming from a C background this anecdote is baffling to me.

I am writing some lambda OAuth glue logic using python because python is the best choice for this particular implementation. I need to package “jwt” library which worked absolutely fine for a while. Then I upgrade python versions and the particular pre-built container I was using to run pip and zip the packages up ended up with totally borked versions of jwt which seemed to not have functions I was previously using fine.

Dig in, finally figure out that my older version was actually importing PyJWT even though I had specified “jwt.” The new container was breaking because it was actually installing a different “jwt” library. So, the solution was to specifically specify PyJWT and which version I wanted in my pip install. Great! That’s how I think it should work and I was a little baffled that pip had made that decision for me previously.

Anyway, it now has my missing functions but is still crashing in my blue deployments. Wtf? Oh, this PyJWT import is missing an algorithm. To fix that, I also need to pip install “cryptography” (making sure to get the compatible version matrix here, I at this point had stopped trusting pip).

So maybe this is all very obvious and “duh” to veterans out there but this was impossibly silly and wasted a stupid amount of time for something that should be dead simple. yea, wrangling makefiles and fussing with linkers can be annoying but I’d take that any day over this bs.

eesmith

5 hours ago

Some issues I see are:

- packages which only distribute binaries (eg, closed-source or source-for-a-fee distributions)

- it looks like Zig's C compiler does not support OpenMP, which I use

- what is the cut-off time for source vs. binary distribution? My package takes about a minute to compile (it has a lot of auto-generated source).

- what's the user impact if they have 10 projects which are just under that threshold?

- compile-time dependencies which are not recorded in pyproject.toml (like, having a Fortran compiler, having yacc/bison, etc.)

atemerev

4 hours ago

Nope. Wheels are the only thing that makes Python/PyPI usable. I don’t want to wait tens of minutes to recompile pytorch, or something (and conda is way too heavyweight for my tastes)

zzzeek

5 hours ago

My own personal TL;DR would be, pypi has to store too much data in the form of pre built binaries that are uploaded by package authors. Python should adopt a repeatable build format so that PyPi itself can build wheels for any platform on demand (edit: am I misunderstanding? did they mean, wheels can be built as part of the local install process?). Author is involved with some special compiler to do this.

Personally I'd love if PyPi could build our wheels for us. That would be great, we use GitHub actions right now which has its own complexities and for years we had nothing. But that would mean a huge ramp up in processing capability for PyPi. Considering PyPi can't even handle having its packages signed and "solved" the problem by sending authors obnoxious emails if we even dare to push up a signature file, im not too optimistic for such a change vs. their just continuing to rely on corporate sponsors to deliver bandwidth.

robertlagrant

2 hours ago

My read was it enables both: PyPI can delete all packages that haven't been downloaded in a year or more, safe in the knowledge that they can be quickly recreated and cached, but also that more packages can be downloaded as source, and compiled at target.

Siecje

5 hours ago

Compatible releases should replace older versions.

Why would you want to install an old version of ruff?