I scanned 2,500 Hugging Face models for malware/issues. Here is the data

24 pointsposted 17 days ago
by arseniibr

21 Comments

embedding-shape

13 days ago

> Broken files — 16 models were actually Git LFS text pointers (several hundred bytes), not binaries. If you try to load them, your code crashes.

Yeah, if you don't know how use the repositories, they might look broken :) Pointers are fine, the blobs are downloaded after you fetch the git repository itself, then it's perfectly loadable. Seems like a really basic thing to misunderstand, given the context.

Please, understand how things typically work in the ecosystem before claiming something is broken.

That whatever LLM you used couldn't import some specific libraries also doesn't mean the repository itself has issues.

I think you need to go back to the drawing board here, fully understand how things work, before you set out to analyze what's "broken".

arseniibr

13 days ago

In an ideal local environment with a properly configured git client, sure. But in real-world CI/CD pipelines, people can use wget, curl, or custom caching layers that often pull the raw pointer file instead of the LFS blob. When that hits torch.load() in production, the service crashes. The tool was designed to catch this integrity mismatch before deployment.

embedding-shape

12 days ago

Right, but if your CI/CD pipeline is fetching repositories that are using Git LFS while whatever pipeline you're creating/maintaining can't actually handle Git LFS, wouldn't you say that it's the pipeline that would have to be fixed?

Trying to patch your CI builds by adding a tool that scans for licenses, "malware" and other metadata errors on top of all of this feels very much like "the wrong solution", fix the issue at the root instead, the pipeline doing the wrong things.

arseniibr

11 days ago

I agree that fixing the pipeline is indeed the correct decision, but I've created this tool to provide the detection.

In a complex environment, you often don't control the upstream ingestion methods used by every team. They might use git lfs, wget, huggingface-cli, or custom caching layers.

Relying solely on the hope that every downstream consumer correctly handles Git LFS is dangerous. This tool acts as a detector to catch those inevitable human or tooling errors before they crash the production.

embedding-shape

10 days ago

> This tool acts as a detector to catch those inevitable human or tooling errors before they crash the production.

Again, that sounds like a bigger issue, that a repository using Git LFS can somehow "crash the production", that's where I'd add resilience first. But as mentioned in another comment, I don't have the full view of your infrastructure, maybe it has to work like that for whatever reason, so YMMV.

wbshaw

13 days ago

Calling them broken files might not be correct. However, I can see where if you are not diligent about watching commits to those git repos, you end up with a Trojan Horse that introduces a vulnerability after you've vetted the model.

embedding-shape

13 days ago

Well, sure, but how does this tool help in any way with that? Since if you're using Git LFS, the tool just says it's broken, rather than actually pulling down the blobs and checking those. It wouldn't prevent "malicious weights".

Besides, pickle is the data format that introduces the possibility for vulnerabilities, if the model weights are in .safetensor you're safe regardless.

lucrbvi

13 days ago

You should know that there is already a solution for this, SafeTensors [0].

But it may be a nice tool for those who download "unsafe" models

[0]: https://huggingface.co/docs/safetensors/index

arseniibr

13 days ago

Safetensors is the goal, but legacy models are still there. A massive portion of the ecosystem (especially older fine-tunes and specialized architectures) is still stuck on Pickle/PyTorch .bin. Until 100% of models migrate, we need tooling to audit the "unsafe" ones.

patrakov

17 days ago

The single --force flag is not a good design decision. Please break it up (EDIT: I see you already did it partially in veritensor.yaml). Right now, according to the description, it suppresses detection of both genuinely non-commercial/AGPL models and models with inconsistent licensing data. Also, I might accept AGPL but not CC-BY-NC.

Probably, it would be better to split it into --accept-model-license=AGPL --accept-inconsistent-licensing --ignore-layer-license-metadata --ignore-rce-vector=os.system and so on.

arseniibr

17 days ago

Thank you for the valuable feedback. I agree that having granular CLI flags is better for ad-hoc scans or CI pipelines where you don't want to commit a config file. Splitting it into --ignore-license vs --ignore-malware (which should probably never be ignored easily) is a great design decision. Added to the roadmap!

amelius

13 days ago

> loading them with torch.load() can lead to RCE (remote command execution)

Why didn't the Torch team fix this?

embedding-shape

13 days ago

OP misunderstands, the issue is specifically with the pickle format, and similar ones, as they're essentially code that needs to be executed, not just data to be loaded. Most of the ecosystem have already moved to using .safetensor format which is just data and doesn't suffer from that issue.

arseniibr

13 days ago

Safetensors solves RCE, but it doesn't solve legal liability. I scan .safetensors because metadata headers often contain restrictive licenses (like CC-BY-NC) that contradict the repo's README. Deploying a non-commercial model in a commercial SaaS is a security/compliance incident, even if no code is executed (PS I'm in the EU and it's important for us).

Additionally, a massive portion of the ecosystem is still stuck on Pickle/PyTorch .bin.

embedding-shape

12 days ago

Right, but in these environments (PS, I'm also in the EU, also work in the ecosystem) we don't just deploy 3rd party data willy nilly, you take some sort of ownership of the data, review+polish and then you deploy that. Since security and compliance is important for you, I'm assuming you're doing the same?

And when you're doing that, you have plenty of opportunity to turn Pickle into whatever format you want, since you're holding and owning the data anyways.

arseniibr

11 days ago

Don't you suppose that in a large company with teams of 50+ devs/DS pulling models for experiments, enforcing a manual "review+polish+convert" workflow for every single artifact can create a massive bottleneck and, as a result, shadow IT? Doesn't it make sense to automate the "review" part?

embedding-shape

10 days ago

If you run teams with 50+ devs then you MUST ensure the pipelines actually work, for every single project they work on, you don't PATCH validation on top of what seems to already be brittle in your infrastructure.

But I don't manage the infrastructure where you work, I don't have the full picture. But it sounds to me like there is a different issue going on, the issue isn't "Some HF repos use Git LFS so we need a tool to flag those".

arseniibr

13 days ago

PyTorch relies on Python's pickle module for serialization, which is essentially a stack-based virtual machine. This allows for saving arbitrary Python objects, custom classes, etc., but the trade-off is security. The PyTorch docs explicitly say: "Only load data you trust."

"torch.load() unless weights_only parameter is set to True, uses pickle module implicitly, which is known to be insecure. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never load data that could have come from an untrusted source in an unsafe mode, or that could have been tampered with. Only load data you trust. — PyTorch Docs"

In the real world, some people might download weights from third-party sources. Since PyTorch won't sandbox the loading process, I did the tool to inspect the bytecode before execution.

user

17 days ago

[deleted]