hackernews client

The Python Package Index Should Get Rid of Its Training Wheels

92 pointsposted 10 months ago

106 Comments

BiteCode_dev

10 months ago

One of the reasons Python is so popular as a scripting language in science and ML is that it has a very good story for installing Frankenstein code bases made of assembly, C and Pascal sprinkled with SIMD.

I was here before Anaconda popularized the idea of binary packages for Python and inspired wheels to replace eggs, and I don't want to go back to having to compile that nightmare on my machine.

People that have that kind of idea are likely capable of running kubs containers, understand vectorization and can code a monad on the top of their head.

Half of the Python coders are struggling to use their terminal. You have Windows dev that lives in Visual Studio, teachers of high school that barely show a few functions, mathematicians that are replacing R/Matlab, biologists forced to script something to write a paper, frontend dev that just learned JS is not the only language, geographers that are supplicating their GIS system to do something it's not made for, kids messing with their dad laptop, and probably a dog somewhere.

Binary wheels are a gift from the Gods.

HelloNurse

10 months ago

Compiling Python extensions is a nightmare because we allow Autoconf, CMake, Visual Studio, Bazel etc. to make it complicated and nonportable; when someone sets out to wrap some library for Python the quality of the result is limited by the quality of the tools and by low expectations.

A serious engineering effort along the lines of the Zig compiler would allow Python to build almost everything from source out of the box; exotic compilers and binary dependencies, not "Frankenstein code bases" per se, are the actual obstacles.

ddulaney

10 months ago

What you’re proposing here is essentially “if we fix the C/C++ build systems environment, this would be easy!”. You’re absolutely right, but fixing that mess has been a multi-decade goal that’s gone nowhere.

One of the great victories of new systems languages like Rust and Zig is that they standardized build systems. But untangling each individual dependency’s pile of terrible CMake (or autoconf, or vcxproj) hacks is a project in itself, and it’s often a deeply political one tied up with the unique history of each project.

kristoff_it

10 months ago

> What you’re proposing here is essentially “if we fix the C/C++ build systems environment, this would be easy!”. You’re absolutely right, but fixing that mess has been a multi-decade goal that’s gone nowhere.

Not sure I would call it easy, as it would still take a lot of effort to update how PyPI works to account for these new capabilities, but that's exactly what the Zig compiler & build system solved.

Rust is completely hands off when it comes to C/C++ dependencies, Zig can package and build them. That's why I created https://github.com/allyourcodebase/.

As I've mentioned on lobsters, look a this example `build.zig.zon` file: https://github.com/allyourcodebase/srt/blob/main/build.zig.z...

It mentions 3 dependencies:

    Haivision/srt, the upstream C++ project
    mbedtls
    googletest

When you run zig build all these 3 dependencies are downloaded and their build.zig is run if present (the first one doesn’t have a build.zig since it’s just the vanilla C/C++ upstream project that we are providing a build script for).

The work to package everything must still happen, but once it’s done correctly you get the ability to build from any host for any target, which you can literally see happening in the CI runs of that repo: https://github.com/allyourcodebase/srt/actions/runs/10982569...

This kind of skepticism is exactly why I wrote this post. The details of actually coming up with a realistic upgrade path for PyPI are certainly much more nuanced than what I wrote in the post, but the core insight is that the C/C++ build intractability problem has been solved... and that you shouldn't depend on free big tech money if you can.

ddulaney

10 months ago

In my experience, `zig cc` is great at cross-compiling (which is a hard problem that it solves!) but doesn't actually help with the rest of the problem.

You still have to run cmake/autoconf/meson/whatever else, which is the part that's project-specific and often quite fiddly.

lifthrasiir

10 months ago

This has been my experience as well. I previously used cargo-zigbuild for packaging at work (and contributed both to cargo-zigbuild and Zig as a byproduct), and I still had several troubles that I had to analyze and tackle in spite of them.

kristoff_it

9 months ago

Well if you're limiting yourself to zig cc, then all you get is a C compiler.

If you take the time to kick out of your project that soup of build systems and replace them with a build.zig, then you got yourself a complete solution. Takes some effort but it's perfectly doable, see https://github.com/allyourcodebase

HelloNurse

10 months ago

The "upgrade path" is for individual packages (providing portable build scripts) and for client-side Python package management (actually providing and running next-generation tools), not for PyPI which already supports package metadata stating what platforms a package is compatible for.

sitkack

10 months ago

They do, Zig ships clang. The easiest way to install clang is via pip.

    pip install ziglang

    python -m ziglang clang --version
    clang version 18.1.6 (https://github.com/ziglang/zig-bootstrap 98bc6bf4fc4009888d33941daf6b600d20a42a56)
    Target: aarch64-unknown-darwin23.6.0
    Thread model: posix
    InstalledDir: /usr/bin

Very soon, Zig will all your (code)base.

kbolino

10 months ago

Zig is divorcing itself from Clang: https://github.com/ziglang/zig/issues/16270

kristoff_it

10 months ago

Which doesn't mean what you're implying it means: https://github.com/ziglang/zig/issues/16270#issuecomment-161...

kbolino

10 months ago

I think it does. If you want Clang, install Clang.

stevenhuang

10 months ago

More info https://github.com/ziglang/zig/issues/16270#issuecomment-161...

kbolino

10 months ago

You're the second person to post a link to a comment that's prominently highlighted on the issue I already linked. Yes, I've read that; nowhere do I see a statement like "Zig will always ship Clang with it" and instead I see a number of statements that imply it won't. I'm not even saying that getting Clang out of Zig is a bad thing. It's not like CMake or Rust come bundled with a C compiler.

stevenhuang

10 months ago

I suggest you read it again then, because it heavily implies that.

> These use cases can still be satisfied by, again, an independently maintained project that combines Clang main() and Zig main() together. For users of these CLI tools, I don't expect there to be any difference in user experience.

Means when someone installs ziglang from their package manager, it will be able to build c.

user

10 months ago

[deleted]

kbolino

10 months ago

Yeah, I've read it multiple times. Every time, it says to me "somebody somewhere else can package these things together, it just won't be Zig". There must be something about Andrew Kelley's communication style that clicks with other people but not me. But I just can't read into it what you say is there. Since this all boils down to interpreting one somewhat arrogant man's words, it ceases to be a technical discussion and just becomes an argument in semantic parsing. I'm just not going to comment much on Zig anymore as clearly other people know what's going on and I keep getting it wrong.

sitkack

10 months ago

Zig is and will be shipping clang for a great long while. Not only will it be shipping clang so you can build other code with it, they aren't removing LLVM support.

https://github.com/ziglang/zig/issues/16270#issuecomment-205...

kbolino

10 months ago

I raise you https://github.com/ziglang/zig/issues/20875

It is not entirely clear what any of this means for the future, and in any case, it keeps changing all the time.

The promise to maintain LLVM support just applies to Zig code. That doesn't require Clang. LLVM alone is not a C compiler, and neither is Zig without Clang.

What very well may happen is that the "standard" distribution of Zig will continue to bundle Clang, while a "minimal" distribution also gets created which omits Clang, and thus cannot compile C code. However, none of this has been spelled out clearly yet, and so I think it's reasonable to say don't depend on Zig to give you Clang.

sitkack

10 months ago

Zig is migrating away from LLVM as the mandatory backend, it will still ship clang and llvm so that it can build everything it needs which is critical for its mission for all your (code) base.

HelloNurse

10 months ago

The typical build scripts and tools and terrible hacks that are adequate for a standalone C/C++/Fortran project are lacking for building the same project as a Python extension, which requires portability and supported, not too custom, build steps.

The ambition and usefulness of aiming for the latter higher standard is quite new: it has gone (relatively) nowhere because it has been a multi-decade non-goal, which only a small minority of users cares about.

theamk

10 months ago

Hard-to-build python extensions are basically .so files with some special symbols exposed.

As others said, a general task to "build libfoo.so" is very complex, and currently requires a wide variety of build systems. I don't see why this will get any easier if we require this .so file to export some Python-specific symbols.

BiteCode_dev

10 months ago

Problem: saving bandwidth for pypi.

Solution: unify all software build stack.

Tomorrow, chat, we will tackle world hunger in an attempt to save people from anorexia.

Like and subscribe.

Spivak

10 months ago

It's not that farfetched, Nix is living proof that it's possible. Zig's perspective is just from the inside of your codebase.

lifthrasiir

10 months ago

It wasn't even intentional for Zig at the beginning, though. And any such solution should be usable across all platforms, while Nix on Windows is still not remotely usable (the latest attempt seems to be [1]).

[1] https://lastlog.de/blog/libnix_roadmap.html

klyrs

10 months ago

You and your respondents see this as a Python problem. I see it as a Zig problem.

As in, Zig will seamlessly add Python to its build system long before Python's build story is so robust.

aragilar

10 months ago

Autoconf is easy, it's the highly bespoke build systems that someone thought would be a good idea that require the right phase of the moon that are the challenge.

aragilar

10 months ago

Wheels do predate conda (though manylinux did base itself on the experience of Anaconda and Enthought's base set of libraries), and there were distributions like Enthought (or even more field specific distributions like Ureka and individuals like Christoph Gohlke) that provided binaries for the common packages.

What the conda ecosystem did was provide a not-horrible package manager that included the full stack (in a pinch to help fix up a student's git repository on a locked-down Windows system I used conda to get git, and you can get R and do R<->python connections easily), and by providing a standard repository interface (as opposed to a locked down and limited version that the other providers appeared to do), conda pushed out anyone doing something bespoke and centralised efforts so spins like conda-forge, bioconda, astroconda could focus on their niche and do it well.

RockRobotRock

10 months ago

If I didn’t have access to binary wheels, I wouldn’t have been able to learn Python in high school as a heavy Windows user.

Even now, running into package compile issues is a sure fire way to lose half an hour of my time.

tonnydourado

10 months ago

> was here before Anaconda popularized the idea of binary packages for Python

This has incredible "I was there, Gandalf, three thousand years ago" vibes =P

nektro

10 months ago

the article does not suggest removing wheeels

yaleman

10 months ago

The fact that tensorflow takes up 12.9TiB is truly horrifying, and most of that because they use pypi's storage as a dumping ground for their pre-release packages. What a nightmare they've put on other people's shoulders.

theamk

10 months ago

I think pypi should require larger packages, like tensorflow, to self-host their releases.

There is all support for that already - the pypi index file contains arbitrary URL for data file and a sha256 hash. Let pypi store the hashes, so there is no shenanigans with versions being secretly overridden, but point the actual data URLs to other servers.

(There must obviously be a balance for availability vs pypi's cost, so maybe pypi hosts only smaller files, and larger files must be self-hosted? Or pypi hosts "major releases" while pre-releases are self-hosted? And there should be manual exceptions for "projects with funding from huge corporations" and "super popular projects from solo developers"...)

aragilar

10 months ago

I believe tensorflow does remove old pre-releases (I know other projects do), so that number I think might be fairly static?

That tensorflow is that big isn't surprising, given the install of it plus its dependencies is many gigabytes (you can see the compressed sizes of wheels on the release pages e.g. https://pypi.org/project/tensorflow/#files), and the "tensorflow" package (as opposed to the affiliated packages) based on https://py-code.org/stats is 965.7 GiB, which really only includes a relatively small number of pre-releases.

Why tenserflow is that big comes down to needing to support many different kinds of GPUs with different ecosystem versions, and I suspect the build time of them with zig cc (assuming it works, and doesn't instead require pulling in a different compiler/toolchain) would be so excessive (especially on IoT/weaker devices) that it would make the point of the exercise moot.

amoshebb

10 months ago

Is it though? If it saves one engineer one afternoon that storage has paid for itself, and this thing has hundreds of thousands of downloads a day.

Wouldn’t it be more horrifying to force everybody who wants to use a prerelease to waste an afternoon getting it to build just to save half a hard drive?

skeledrew

10 months ago

That's besides the point though. Yes, having prebuilt binaries is very helpful. But what happens if Fastly decides against renewing next time and there is nobody else willing to sponsor? The cost is through the sky for the PSF to handle. Where does PyPI go?

zahlman

10 months ago

The issue there is how much gets downloaded from PyPI, not how much storage space it takes up. Making an archive copy of all of PyPI would only require a handful of ordinary drives. But the most popular packages get on the order of ten million daily downloads each.

When someone downloads and installs a package, it's generally a single wheel or sdist, out of a potentially massive version/platform matrix for that project. That inflates the storage cost, but it isn't why the bandwidth requirements are so high. A ton of CI systems are apparently poorly designed, individual wheels are bloated, and we can't use the best available compression. Those are the biggest issues.

choeger

10 months ago

The analysis contains an error. Binary artifacts don't cause exponential growth in storage requirements. It's still just linear. That's also quite clearly seen when after a phase of exponential growth, the binary artifacts still only account for 75%.

So this whole strategy (actually a pitch for zigbuild) would ideally reduce the storage requirements to 25% - which would buy the whole system maybe a year or two if the growth continues.

Of course, it's a good idea to build client-side. Especially considering the security implications. But it won't fundamentally change the problem.

kristoff_it

10 months ago

The problem is not space, it's the fact that bandwidth costs 4x the total operating income that the PSF has. Everything else is context to understand and offer some insight into this problem.

This is a pitch for Zig as a C/C++ build system because, as you can see in other comments in this submission, there's still a lot of people that don't even believe this is a solvable problem at all, while in reality Zig does a pretty good job at solving it.

Only somebody actively involved with Python can really say if and how this could help improve the situation for PyPI, but if the people who actually happen to be working on something potentially beneficial weren't allowed to speak "because it's a pitch" then what even is the point of software interoperability.

As an occasional Python user, I also think that it's silly that people keep putting containers everywhere and pre-building a ton of binaries when I know for a fact that you could trivially `zig biuld` that shit, since I do it in my projects all the time (see https://github.com/allyourcodebase/).

zahlman

10 months ago

>it's the fact that bandwidth costs 4x the total operating income that the PSF has.

If you're getting this from my analysis included in https://news.ycombinator.com/item?id=41586751, I would encourage you to consider some of the replies to my comment. I cited and worked with AWS retail rates because that's the information I had available. It's still a major contribution on Fastly's part which likely dwarfs whatever Microsoft and Google are offering.

For what it's worth, I suspect those containerized builds are responsible for the overwhelming majority of Setuptools downloads, for example. There's really no good reason why annual downloads of Setuptools (or really anything else) should rival the population of the Earth (https://pypistats.org/packages/setuptools) - Pip caches downloaded wheels, and separate copies of Pip should ordinarily share a cache on a given machine. But I wouldn't be at all surprised to learn the Docker, k8s etc. defeat that caching scheme.

choeger

10 months ago

> For what it's worth, I suspect those containerized builds are responsible for the overwhelming majority of Setuptools downloads, for example

I agree to that hypothesis. But other distributions (say, Debian) should be facing the very same problem, right? How often is Python downloaded from Deb mirrors nowadays?

chippiewill

10 months ago

> Binary artifacts don't cause exponential growth in storage requirements. It's still just linear.

I think what the author was getting at was the combinatoric explosion as you add different variants (CPU architecture, operating systems, python versions) - but I agree that "exponential growth" is not the right term to use here.

pxc

10 months ago

There are already lots of passable package managers that know how to provide working binaries for the native, non-Python dependencies of Python packages. Instead of trying to make Python packages' build processes learn how to build everything else in the world, one thing Python packages could do is just record their external dependencies in a useful way. Then package managers that are actually already designed for and committed to working with multiple programming language ecosystems could handle the rest.

This is something that could be used by Nix, Guix, Spack, as well as more conventional software distributions like Conda, Pkgsrc, MacPorts, Homebrew, etc. With the former, users could even set up per-project environments that contain those external dependencies, like virtualenvs but much more general. But the simple feature of this metadata would naturally be valuable, if provided well, to maintainers of all Linux distros and many other software distributions, where autogenerated packages are already the norm for languagea like Rust and Go, while creating such tooling for Python is riddled with thorny problems— o these two proposals are not mutually exclusive, and perhaps each is individually warranted on its own.

Enriching package metadata in this simple way has already been proposed here:

htps://peps.python.org/pep-0725/

trlampert

10 months ago

"When Python came into existence, repeatable builds (i.e. not yet reproducible, but at least correctly functioning on more than one machine) were a pipe dream. Building C/C++ projects reliably has been an intractable problem for a long time, but that's not true anymore."

I'd dispute that. It used to be the case that building NumPy just worked, now there are Cython/meson and a whole lot of other dependency issues and the build fails.

"At the Zig Software Foundation we look up to the Python Software Foundation as a great example of a fellow 501(c)(3) non-profit organization that was able to grow an incredibly vibrant community ..."

Better don't meet your heroes. Python was a reasonable community made up and driven by creative individuals. In the last 8 years, it has been taken over by corporate bureaucrats who take credit for the remnants of that community and who will destroy it. The PSF has never done anything except for selling expensive conference tickets and taking care of its own.

rightbyte

10 months ago

These package respositories are used in a wasteful way. Probably by thousands of CI servers spinning up blank slate docker containers etc.

strokirk

10 months ago

CI providers should definitely start proxying PyPI with their own cache.

Numerlor

10 months ago

I wanted to spin up a mirror locally to do simple caching for docker builds but the tooling was lacking, there was a way to do a direct mirror of pypi locally but no other way of adding custom indices

jeroenhd

10 months ago

I think Sonatype Nexus [1] can do that relatively easily. I don't know if the OSS version is enough, but I think most people and projects should be fine.

[1]: https://www.sonatype.com/products/sonatype-nexus-oss-downloa...

eKIK

10 months ago

We've used Nexus OSS just the way you describe and it worked great.

We simply set it up as a kind of "passthrough cache", so if it didn't have the package it fetched it from pypi, and stored it to be used the next time someone wanted to install the same package.

Apart from being nice to pypi, we also got a bit of a decrease in CI runtime, because it fetched packages from the local cache 99% of the time.

TheChaplain

10 months ago

DevPi might be your answer I think. Couple of years ago I set it up as a proxy plus hosting my own packages locally.

Numerlor

10 months ago

I'll take a look, I think it is something I looked at and had some issues with but it has been a couple years and the only thing I can remember is bandersnatch

0x138d5

10 months ago

Artifactory?

zahlman

10 months ago

Pip already does its own caching, but it's maddeningly difficult to even locate and extract files, let alone set up anything usable. It's also needlessly difficult to make Pip just use such a cache directly - for example, if you haven't pinned a version, it will automatically check PyPI to figure out the latest version even if you have cached wheels already.

I don't know (I lack the experience), but I assume that container systems get in the way of Pip finding its normal cache, too. (If they're emulating a filesystem or something, then the cache is in the wrong filesystem unless you reuse the container.)

Spivak

10 months ago

It might honestly be easier and cheaper for those CI providers to just pay PyPI's bandwidth costs.

robertlagrant

10 months ago

Yeah we use Artifactory internally as a proxy for DockerHub, Pypi, NPM, etc.

faustin

10 months ago

conda-forge handles the first part of this (reproducible builds) for most common platforms. The idea of rebuilding deleted artifacts on demand sounds nice in theory, but it has the complication that rebuilding something that depends on several other somethings will likely trigger a build cascade where a bunch of stuff has to get built in order. Hopefully none of those ancient build scripts require external resources hosted at dead links!

aragilar

10 months ago

Also, this is very much assuming that the code is both C or C++ and that LLVM is the right compiler to use. Fortran is still a major part of the ecosystem, which the Zig compiler isn't going to solve. There already exists numerous options to provide compilers to the problematic platform, the fact is binary wheels (mostly) solve the issue far better than doing local builds.

Also, the large packages are typically due to the need to support the huge number of possible GPU combinations (because you care about what CUDA versions are supported).

This feels like a solution being forced on a problem (not that zig cc isn't cool), but the post has really misunderstood the issues around wheels.

AndyKelley

10 months ago

One strategy would be making PyPI packages fetch any external resources from PyPI, or at least add PyPI URLs as mirrors for such resources.

zamlag

10 months ago

At work we don't use PyPI any longer. We have our own set of curated packages, the security issues are just too great:

https://developers.slashdot.org/story/24/09/15/0030229/fake-...

https://jfrog.com/blog/revival-hijack-pypi-hijack-technique-...

https://jfrog.com/blog/leaked-pypi-secret-token-revealed-in-...

We consider switching to Java, C++ or Rust because of general quality issues with Python.

notpushkin

10 months ago

What do Java and Rust (and your C++ package manager of choice) do to mitigate those things?

theamk

10 months ago

Don't know about Java or Rust, but in C++ it's much harder to get new packages, which works wonder to keep number of dependencies down.

In Python, installing packages is so simple, people just do it after 10 second google search - and that library can pull dozens or hundreds of dependencies with usually no review.

In C++, given that you have to manually find the library and incorporate into the build system, people generally spend a few minutes looking at the options and choosing the best one - and this includes checking for things "how long has this library been around", "does it have many users" and "is it in healthy state". And this must be repeated for each dependency as well, so that even a dozen of dependencies is a huge negative - and such library will not be used unless there are no better alternatives.

(The exception to this rule are libraries provided by your Linux distribution. Those can be easily installed by the dozen, and that's OK - the distribution makers did all the hard work for you, vetting and packaging those libraries)

This in general means a much healthier dependency state for C++, as well as much higher code quality. No one is going to add a dependency to a core library just to add a better progress bars for example.

user

10 months ago

[deleted]

ajrqh

10 months ago

What is not clear about "general quality issues", i.e., issues unrelated to package management?

notpushkin

10 months ago

OK, my bad. It is even more unclear though – do you have any particular examples of such issues?

eesmith

10 months ago

Some issues I see are:

- packages which only distribute binaries (eg, closed-source or source-for-a-fee distributions)

- it looks like Zig's C compiler does not support OpenMP, which I use

- what is the cut-off time for source vs. binary distribution? My package takes about a minute to compile (it has a lot of auto-generated source).

- what's the user impact if they have 10 projects which are just under that threshold?

- compile-time dependencies which are not recorded in pyproject.toml (like, having a Fortran compiler, having yacc/bison, etc.)

Wowfunhappy

10 months ago

I'm a bit confused as to why this costs so much. I thought storage was cheap?

Bandwidth is more expensive, but shouldn't be relevant to this problem. It doesn't matter whether 5 people request the same binary or 5 people request 5 different binaries for different platforms, if all the binaries are 1 gb you're transferring 5 gb of data either way.

kristoff_it

10 months ago

Once you have control over generating the binaries, you can create a system that is more akin to what Nix offers, where other parties can setup secondary caches for their projects, and if they don't the main registry has the option to decide what to do about it without breaking any package.

Currently PyPI is railroaded into a fairly tight set of choices because it can't exercise a lot of control over binary data.

The blog post mentions this point directly.

Wowfunhappy

10 months ago

Thanks. The blog post alludes to mirrors but I didn't realize that was the primary goal of this project.

I am somewhat skeptical that these projects will have an incentive to actually set up mirrors. For example, I imagine Tensorflow likes getting free hosting.

kristoff_it

10 months ago

I haven't checked, but usually when it's this visible, big tech companies tend to reciprocate so I would be surprised if Google didn't sponsor the PSF in any way.

That said, once deleting prebuilt binaries doesn't break the package anymore, if Google were to not play fair, PyPI could simply delete them. Users would experience temporary discomfort (but nothing would be irremediably broken), and soon after I'm sure that Google would decide to setup a cache for TF.

PyPI would almost certainly not even need to actually do it, the implication that it could be done would probably be enough to align everybody towards what's best for the ecosystem.

bravetraveler

10 months ago

Really depends on how you get the storage/interface with it. Cloud/VPS is some of the worst 'bang for the buck' in my experience, where S3 or dedicated can be more favorable at a point. Not all gigglebytes cost the same :)

The article may shed some light on this, I'm not aware. Haven't read yet! I've managed a few RPM package mirrors. Dedicated gear that's a few generations old with no-name providers was the best experience.

rwmj

10 months ago

Just use the system packages! Fedora, Debian, AUR, brew/macports on macOS, etc are all a thing, use them.

BiteCode_dev

10 months ago

There are currently 570K+ projects in pypi, 60k+ in debian repos.

It can take several months of work to approve one single package to the official repos, for a single distribution. And each have different rules and setup.

Now explain to me how you think this is going to work.

Also, do you place to force everyone to use chroot or containers to replace their virtual env systems to have variations on deps? Or maybe everybody should use nix?

Do you want to do that also for JS, Ruby and PHP?

rwmj

10 months ago

> There are currently 570K+ projects in pypi, 60k+ in debian repos.

Not all PyPi projects require C code.

> It can take several months of work to approve one single package to the official repos

This is a massive exaggeration. For Fedora it takes a couple of days, all of it being necessary review of the code and licenses. And yes, you do have to do that work, it's done by the distros themselves too.

pxc

9 months ago

You only need the system package manager to provide the non-Python dependencies of Python packages. I'd expect that those by and large are already packaged in Debian and elsewhere, perhaps with some exceptions among smalltime C (or Rust) projects that only exist to accelerate Python packages.

> Also, do you place to force everyone to use chroot or containers to replace their virtual env systems to have variations on deps? Or maybe everybody should use nix?

Nix is a good fit for this, but so is Guix and probably Spack. With a sane implementation of PEP-725 (which I mention and describe in another comment on this post), users could freely choose whatever package manager they like to supply non-Python deps.

zahlman

10 months ago

There are many problems with this idea.

* Windows exists.

* The system package manager is meant for the system environment. The typical Python environment is separate from that - a container or a venv. The system Python installation is there to make the OS work, not to provide a working dev environment, and it's often not suitable as such (especially on Debian).

* There simply are far too many packages out there to make it workable.

* Part of the point of how Pip works is that compiled code can be pre-built for specific platforms. A system package manager typically loses that advantage and needs to rebuild from source.

* Installing Python packages "from source" commonly involves running arbitrary Python code at install time, by design. (The build system simply doesn't provide any standardized way to invoke compilers, nor to apply conditional logic - you're stuck with Turing-complete, unsandboxed Python. Python is really hard to sandbox anyway.) For obvious reasons it's not appropriate to do that as root, which rather interferes with using a system package manager. Yes, obviously packages that mirror PyPI distributions and are made to be installed with the system package manager, will be adapted to avoid security issues. But this requires significant, parallel maintenance effort. The devs of the individual projects likely don't have resources to spare to help distro maintainers, either.

cozzyd

10 months ago

Yeah but do you really need a separate copy of numpy/matplotlib/scipy/etc. for every single project? I don't have infinite disk space...

zahlman

10 months ago

If your projects are using different versions of the library and/or different Python versions, then yes. Especially for libraries like NumPy/SciPy that rely on compiled code. It must be re-compiled against the API of a specific Python minor version, for many reasons (e.g., the bytecode format changes with almost every minor version).

If your projects can get away with having the same exact set of libraries installed, then no - just reuse the same virtual environment.

In between is, I would argue, mostly Pip's fault. There's no facility for installing via symlinks, or hard links, or `.pth` files, even though all of those would work fine in Python. Granted, using `.pth` files causes `sys.path` to blow up at startup, which some people claim is a performance issue (it's searched linearly) - but it's honestly probably not a meaningful problem in remotely normal cases (especially considering that module imports are cached in `sys.modules`).

Pip also unpacks wheels, even though Python could theoretically import from them directly (they're renamed zip files). This is partly for performance reasons but largely because it allows installed packages to read their own data files using relative paths and the ordinary file I/O API. For packages that don't need to do that, the egg format implemented a `zip_safe` flag, but for whatever reason, wheels don't have this.

kbolino

10 months ago

AUR is not a source of system packages. It is a community-maintained source of package build scripts which each user must download and build themselves.

tempfile

10 months ago

This is how it should be, but it isn't how it is. The system packages are generally very outdated, and don't support multiple versions on a single system (which makes dependency locking a fantasy).

I dream of the day where we only have one package manager on a single machine.

echelon

10 months ago

Python builds weren't really hermetic before, but now your entire operating system is a part of the software definition.

This is a slide backwards.

aragilar

10 months ago

The OS always was part of the software definition, and always will be (unless you run without an OS/write your own).

echelon

10 months ago

How much of the host system, though? You want it to be as thin as possible. And ideally it wouldn't be a part of the definition at all.

aragilar

10 months ago

Why though? Rather than using your own bespoke graphics system (requiring the user to separately configure it) and your own custom DNS server (as opposed to the system one), you could just use the OS. That is all part of the definition of the software, no software is an island, and those that try to be tend to be frustrating to deal with.

user

10 months ago

[deleted]

cozzyd

10 months ago

Plenty of python packages depend on various system packages outside of PyPI anyway.

kristoff_it

10 months ago

Yeah, and that's another source of jankyness that could be avoided

cozzyd

10 months ago

Not easily on all platforms...

cozzyd

10 months ago

I wish there was an easy way for pip to list what it's missing so it's easier to install dependencies via the system package manager. Also it's sad how setup tools bdist_rpm seems to be considered deprecated.

zahlman

10 months ago

>I wish there was an easy way for pip to list what it's missing so it's easier to install dependencies via the system package manager.

PEP 725 is trying to be exactly that (https://peps.python.org/pep-0725/). It just took this long, since the implementation of the new `pyproject.toml` system, to get around to it.

But aside from that, the system package manager is only appropriate for "dependencies" that need it - i.e. things that need to actually be part of the local system, not packages side-by-side in a virtual environment somewhere.

>Also it's sad how setup tools bdist_rpm seems to be considered deprecated.

Historically, Setuptools tried to be a complete solution for everything except actually fetching code. People were even expected to setup.py install locally after downloading something from GitHub or whatever. Nowadays it's supposed to be just a build backend, for the kind of building processes that can be automated under Pip's direction. `bdist_rpm` simply doesn't fit that mold. You're supposed to use a third-party build system instead.

ForHackernews

10 months ago

Doesn't brew compile from source every time on the user's machine?

notpushkin

10 months ago

They do have binary packages now I think.

duskwuff

10 months ago

It mostly uses precompiled packages nowadays. (Which are hosted for it by GitHub.)

atemerev

10 months ago

Nope. System packages and language dependencies are better not to be mixed, and the horrors if you do that are very real.

candiddevmike

10 months ago

I can't tell if this is sarcasm.

HelloNurse

10 months ago

On MacOS they are casks of sarcasm.

Havoc

10 months ago

12TB for tensorflow is absurd.

atemerev

10 months ago

Nope. Wheels are the only thing that makes Python/PyPI usable. I don’t want to wait tens of minutes to recompile pytorch, or something (and conda is way too heavyweight for my tastes)

JohnMakin

10 months ago

I don’t pretend to know the answer here, but python is unavoidable in my work and the packaging is a constant source of irritation for me. Disclaimer: I am not pretending to be a python expert here but coming from a C background this anecdote is baffling to me.

I am writing some lambda OAuth glue logic using python because python is the best choice for this particular implementation. I need to package “jwt” library which worked absolutely fine for a while. Then I upgrade python versions and the particular pre-built container I was using to run pip and zip the packages up ended up with totally borked versions of jwt which seemed to not have functions I was previously using fine.

Dig in, finally figure out that my older version was actually importing PyJWT even though I had specified “jwt.” The new container was breaking because it was actually installing a different “jwt” library. So, the solution was to specifically specify PyJWT and which version I wanted in my pip install. Great! That’s how I think it should work and I was a little baffled that pip had made that decision for me previously.

Anyway, it now has my missing functions but is still crashing in my blue deployments. Wtf? Oh, this PyJWT import is missing an algorithm. To fix that, I also need to pip install “cryptography” (making sure to get the compatible version matrix here, I at this point had stopped trusting pip).

So maybe this is all very obvious and “duh” to veterans out there but this was impossibly silly and wasted a stupid amount of time for something that should be dead simple. yea, wrangling makefiles and fussing with linkers can be annoying but I’d take that any day over this bs.

zahlman

10 months ago

> Dig in, finally figure out that my older version was actually importing PyJWT even though I had specified “jwt.” The new container was breaking because it was actually installing a different “jwt” library.

Names on PyPI have no necessary correlation with names used to `import` modules in the code. This is known and by design and has always been so; and it is thoroughly documented (e.g. https://packaging.python.org/en/latest/discussions/distribut...). There are at important architectural reasons for it:

* A package installed from PyPI can legally provide any number of importable packages and/or modules - including zero.

* Making it work the way you imagine would require additional verification work, with no clear idea of who's responsible for that work nor what to do when the assumption is violated.

* Sometimes projects have valid reasons to rename what you `import` at top level. It could be to avoid common conflicts; it could be because of internal open-source politics (see: PIL/Pillow); it could be because a legacy name was needed for some disambiguation purpose with Conda; for whatever rare reasons it might just make more logical sense. In the extreme, some name could become a keyword in a later Python version. (There's an actual `async` package on PyPI, although it was last updated 10 years ago.) In all of these cases, there's value in keeping the name used on PyPI the same.

* There are valid reasons why people might want to swap implementations of the same library. The PyPI system allows you to just change the name provided to `pip install` without modifying the code.

You actually ran into that last point: both `jwt` and `PyJWT` on PyPI refer to implementations of JSON Web Tokens that expect you to `import jwt` in the code. They may or may not have different interfaces and supply different functionality, and are maintained by different people. You can't seriously expect to forbid everyone else from publishing such an implementation just because one already exists.

> This PyJWT import is missing an algorithm. To fix that, I also need to pip install “cryptography”

There are a few possibilities:

* The maintainer messed up and forgot to list it as a dependency in the manifest. These things happen. Other people's projects are allowed to have their own dependencies.

* The maintainer listed `cryptography` as an optional dependency or "extra" - so that you don't have to grab it if you aren't using that algorithm. You're expected to read the documentation to understand what your project will need.

* You really were intended to use the `jwt` package, but a new version of that project removed functionality you were depending on. Other people's projects are allowed to change their interfaces and go through deprecation cycles. In this case, you would presumably find that your code also needed to be rewritten to invoke the `cryptography` functionality explicitly/directly.

>making sure to get the compatible version

Pip solves dependencies for you. If you're running into version incompatibility issues, generally it's either the fault of the package maintainers, or the environment you want to create is fundamentally not possible due to conflicting version requirements of your transitive dependencies. (Pip is designed to only install one version of a given package in a given environment, because Python is designed to only `import` a given module from one place and then reuse the cached module object.)

>So maybe this is all very obvious and “duh” to veterans out there but this was impossibly silly and wasted a stupid amount of time for something that should be dead simple

Yes, it's obvious. Yes, normally it is dead simple. Veterans commonly run into hairy problems regardless. There are plenty of faults in the system - this just isn't generally considered one of them.

The scope of the problem solved (or at least undertaken) by the Pip/PyPI system is much larger than you seem to imagine. In particular, it often has to do that same wrangling with linkers and Makefiles that you describe, and in an automated fashion at that. (Python code commonly has to interoperate with extensions written in C - and FORTRAN, and other languages - which is provided as part of the same package.)

JohnMakin

10 months ago

Thank you for the detailed explanation - this makes a lot of sense.

stevenhuang

10 months ago

This is also documented by the library https://pyjwt.readthedocs.io/en/stable/installation.html

> If you are planning on encoding or decoding tokens using certain digital signature algorithms (like RSA or ECDSA), you will need to install the cryptography library. This can be installed explicitly, or as a required extra in the pyjwt requirement:

> pip install pyjwt[crypto]

zzzeek

10 months ago

My own personal TL;DR would be, pypi has to store too much data in the form of pre built binaries that are uploaded by package authors. Python should adopt a repeatable build format so that PyPi itself can build wheels for any platform on demand (edit: am I misunderstanding? did they mean, wheels can be built as part of the local install process?). Author is involved with some special compiler to do this.

Personally I'd love if PyPi could build our wheels for us. That would be great, we use GitHub actions right now which has its own complexities and for years we had nothing. But that would mean a huge ramp up in processing capability for PyPi. Considering PyPi can't even handle having its packages signed and "solved" the problem by sending authors obnoxious emails if we even dare to push up a signature file, im not too optimistic for such a change vs. their just continuing to rely on corporate sponsors to deliver bandwidth.

robertlagrant

10 months ago

My read was it enables both: PyPI can delete all packages that haven't been downloaded in a year or more, safe in the knowledge that they can be quickly recreated and cached, but also that more packages can be downloaded as source, and compiled at target.

Siecje

10 months ago

Compatible releases should replace older versions.

Why would you want to install an old version of ruff?

TheCleric

10 months ago

Because packages shouldn't be upgraded in a project until they have been tested. If you give me no way of downloading version 1.0 of a package in my project because 1.1 supersedes it and 1.1 breaks my project, what am I supposed to do?

zahlman

10 months ago

>Why would you want to install an old version of ruff?

Because something else in your project is broken by the new version.