hackernews client

We built another object storage

133 pointsposted 2 months ago

48 Comments

jamiesonbecker

2 months ago

These questions are meant to be constructively critical, but not hyper-critical: I'm genuinely interested and a big fan of open-source projects in this space:

* In terms of a high-performance AI-focused S3 competitor, how does this compare to NVIDIA's AIstore? https://aistore.nvidia.com/

* What's the clustering story? Is it complex like ceph, requires K8s like AIstore for full functionality, or is it more flexible like Garage, Minio, etc?

* You spend a lot of time talking about performance; do you have any benchmarks?

* Obviously most of the page was written by ChatGPT: what percentage of the code was written by AI, and has it been reviewed by a human?

* How does the object storage itself work? How is it architected? Do you DHT, for example? What tradeoffs are there (CAP, for example) vs the 1.4 gazillion alternatives?

* Are there any front-end or admin tools (and screenshots)?

* Can a cluster scale horizontally or only vertically (ie Minio)

* Why not instead just fork a previous version of Minio and then put a high-speed metadata layer on top?

* Is there any telemetry?

* Although it doesn't matter as much for my use case as for others, what is the specific jurisdiction of origin?

* Is there a CLA and does that CLA involve assigning rights like copyright (helps prevent the 'rug-pull' closing-source scenario)?

* Is there a non-profit Foundation, goal for CNCF sponsorship or other trusted third-party to ensure that the software remains open source (although forks of prior versions mostly mitigates that concern)?

Thanks!

mrweasel

2 months ago

> the page was written by ChatGPT

I wonder in that's why it's all over the place. Meta engine written in Zig, okay, do I need to care? Gateway in Rust... probably a smart choice, but why do I need to be able to pick between web frameworks?

> Most object stores use LSM-trees (good for writes, variable read latency) or B+ trees (predictable reads, write amplification). We chose a radix tree because it naturally mirrors a filesystem hierarchy

Okay, so are radix tree good for write, and reads, bad for both, somewhere in between?

What is "physiological logging"?

randallsquared

2 months ago

A hybrid of physical logging, which is logging page-by-page changes, and logical logging, which is recording the activity performed at an intent level. If you do both of these, it's apparently "physiological", which I imagine was first conceived of as "physio-logical".

I could only find references to this in database systems course notes, which may indicate something.

mirekrusin

2 months ago

But this fractalbits thingy is not open source.

Their internals (in zig, actually interesting functional part) are proprietary.

What is open source is client side to access it from rust.

Did I get it right?

fractalbits

2 months ago

Yes, since zig is moving to new async IO and we'd like to wait for it to be more stable before open sourcing our metadata engine. You can check the underneath runtime which we are using for now: https://codeberg.org/thomas-fractalbits/iofthetiger

throwaway894345

2 months ago

I’m also curious about the Kubernetes story—specifically how can one run this in Kubernetes?

fractalbits

2 months ago

Thanks for the thoughtful and thorough questions! Really appreciate the constructive engagement. Let me address each one:

> NVIDIA's AIstore

AFAIK, AIstore is designed as a caching/proxy layer, but FractalBits is a ground-up object storage implementation with a custom metadata engine (Fractal ART - an on-disk adaptive radix tree). AIstore also seems to focus more on the training data pipeline (prefetch, shuffle, ETL) while we focus on the storage layer performance itself. One thing I am aware of is the (folder) rename operation, which would be a little tricky for caching/proxy layer. I'd like to do more research on the detailed comparison and update in our website. Thanks for mention this.

> What's the clustering story? Is it complex like ceph, requires K8s like AIstore for full functionality, or is it more flexible like Garage, Minio, etc?

You can check our arch doc [1] for the clustering. Right now we are focusing on the cloud (aws) environment, and you can simply type `just deploy create-vpc` for deployment, which is calling cdk underneath to set up all the clustering stuff. You can also run low level cdk commands for customization. We are also working on the k8s deployment for on-prem environment.

> You spend a lot of time talking about performance; do you have any benchmarks?

Yes, published in the README[2] with reproducible setup: - GET: 982K IOPS, 3.8 GiB/s throughput, 2.9ms avg latency, 5.3ms p99 - PUT: 248K IOPS, 970 MiB/s throughput, 6.6ms avg latency Configuration: 4KB objects, 14x c8g.xlarge (API), 6x i8g.2xlarge (BSS), 1x m7gd.4xlarge (NSS), 3-way data replication. Cost: ~$8/hour on-demand. You can reproduce this yourself with `just deploy create-vpc --template perf_demo --with-bench` and run the included benchmark scripts.

> Obviously most of the page was written by ChatGPT: what percentage of the code was written by AI, and has it been reviewed by a human?

Yeah, some blog content was generated by Gemini. For code, I haven't used AI until this September and you can check from the git commit history. The core Fractal ART engine and the io_uring integration are hand-crafted for performance.

I am also learning to work with AI (spec driven). One thing I notice is that AI really made the performance related experiments more efficient than before (framework axum->actix-web, io_uring based rpc, rust arena allocator etc.). I use my nvim editor with customized setup to review every line of code written by AI.

> How does the object storage itself work? How is it architected? Do you DHT, for example? What tradeoffs are there (CAP, for example) vs the 1.4 gazillion alternatives?

Architecture[1]: Not DHT-based. We use a centralized metadata service (NSS) with a custom on-disk Adaptive Radix Tree called Fractal ART. Key design choices: - Full-path naming: Avoids distributed transactions for directory operations - Quorum replication: N/R/W configurable (default 3-way data, 6-way metadata) - Two-tier storage: <1MB objects on local NVMe, larger objects tier to S3 backend

  CAP tradeoffs: We prioritize CP (consistency + partition tolerance). Strong consistency via quorum writes. HA for NSS is under testing.

> Are there any front-end or admin tools (and screenshots)?

Admin UI: Yes, we have a web-based UI (React/Ant Design) for bucket browsing, object management, and access key configuration. Currently basic but functional. Screenshots aren't published yet, but it's bundled with deployments. CLI tooling via standard AWS CLI works out of the box since we're S3-compatible.

> Can a cluster scale horizontally or only vertically (ie Minio)

Scaling: Horizontal for data nodes (BSS) - add more instances, data distributes across them with quorum with data-volume and metadata-volume. API servers are stateless and horizontally scalable. NSS (metadata) is currently single-node but designed for future horizontal sharding(split/fractal) (on roadmap: "support hundreds of NSS and BSS instances"). Different from MinIO's per-server erasure sets - we have proper quorum-based distribution.

> Why not instead just fork a previous version of Minio and then put a high-speed metadata layer on top?

Two reasons: 1. License: MinIO switched to AGPL, then to a more restrictive license. Apache 2.0 was important to us. 2. Architecture: We have a fundamentally different architecture to solve the minio's inherent scalability issue [3]. Bolting these onto MinIO would be a rewrite anyway.

> Is there any telemetry?

Yes, but not fully polished (cloudwatch setup has been commented out). We are using rust crates for tracing & metrics. Currently working on distributed tracing capabilities, and we have embedded trace_id in all our rpc headers. Once it is stabilized, we'd like to update it in the docs.

> Although it doesn't matter as much for my use case as for others, what is the specific jurisdiction of origin?

The company (FractalBits Labs) is incorporated in the United States. The BYOC model means your data stays in your cloud account in your chosen AWS region - we never see your data.

> Is there a CLA and does that CLA involve assigning rights like copyright (helps prevent the 'rug-pull' closing-source scenario)?

CLA: Currently no CLA - contributions are under Apache 2.0 via the standard "inbound = outbound" model. We don't require copyright assignment. This makes a rug-pull harder since we can't relicense without contributor consent. Happy to discuss a more formal CLA if the community prefers one.

> Is there a non-profit Foundation, goal for CNCF sponsorship or other trusted third-party to ensure that the software remains open source (although forks of prior versions mostly mitigates that concern)?

Foundation/CNCF: Not yet, but it's something we're open to as the project matures (Apache/CNCF/LF AI & Data Foundation). For now, the Apache 2.0 license provides baseline protection - any prior version remains open source regardless of future decisions. We'd welcome community input on governance structure as adoption grows. On the other hand, I (with a partner) have been full-time working on this for a whole year, and would also welcome conversations with anyone interested in supporting the project's growth.

[1] https://github.com/fractalbits-labs/fractalbits-main/blob/ma... [2] https://github.com/fractalbits-labs/fractalbits-main/blob/ma... [3] https://github.com/minio/minio/issues/7986

alex-aizman

2 months ago

Appreciate the mention. Just to clarify: AIStore is a reliable storage cluster that can natively operate on both in-cluster and remote data, without treating either as a cache.

Yes, it _can_ be configured to act as a cache in front of any of the four supported Clouds but (a) we've never done so at NVIDIA, and (b) its primary function always was - and remains - reliable, resilient storage for AI/ML workloads.

For details:

* https://github.com/NVIDIA/aistore

jamiesonbecker

2 months ago

Thank you for the comprehensive answers!

(By the way: NVIDIA AIstore is NOT a proxying/caching engine, although it can, which is somewhat unique among these types of stores. AIstore is actually a full S3 engine in its own right, and it's actually extremely capable, although live cluster resizing and ETL requires k8s :( )

kburman

2 months ago

I feel like this product is optimizing for an anti-pattern.

The blog argues that AI workloads are bottlenecked by latency because of 'millions of small files.' But if you are training on millions of loose 4KB objects directly from network storage, your data pipeline is the problem, not the storage layer.

Data Formats: Standard practice is to use formats like WebDataset, Parquet, or TFRecord to chunk small files into large, sequential blobs. This negates the need for high-IOPS metadata operations and makes standard S3 throughput the only metric that matters (which is already plentiful).

Caching: Most high-performance training jobs hydrate local NVMe scratch space on the GPU nodes. S3 is just the cold source of truth. We don't need sub-millisecond access to the source of truth, we need it at the edge (local disk/RAM), which is handled by the data loader pre-fetching.

It seems like they are building a complex distributed system to solve a problem that is better solved by tar -cvf

deliciousturkey

2 months ago

In AI training, you want to sample the dataset in arbitrary fashion. You may want to arbitrarily subset your dataset for specific jobs. These are fundamentally opposed demands compared to linear access: To make your tar-file approach work, the data has to ordered to match the sample order of your training workload, coupling data storage and sampler design.

There are solutions for this, but the added complexity is big. In any case, your training code and data storage become tightly coupled. If you can avoid it by having a faster storage solution, at least I would be highly appreciative of it.

kburman

2 months ago

- Modern DL frameworks (PyTorch DataLoader, WebDataset, NVIDIA DALI) do not require random access to disk. They stream large sequential shards into a RAM buffer and shuffle within that buffer. As long as the buffer size is significantly larger than the batch size, the statistical convergence of the model is identical to perfect random sampling.

- AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not.

- If you truly need "arbitrary subsetting" without downloading a whole tarball, formats like Parquet or indexed TFRecords allow HTTP Range Requests. You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly.

deliciousturkey

2 months ago

Highly dependent on what you are training. "Shuffling within a buffer" still results in your sampling being dependent on the data storage order. PyTorch DataLoader does not handle this for you. High level libraries like DALI do, but this is the exact coupling I wanted to say to avoid. These libraries have specific use cases in mind, and therefore have restrictions that may or may not suit your needs.

AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not.

Agree that throughput is more of an issue than latency, as you can queue data to CPU memory. Small object throughput is definitely an issue though, which is what I was talking about. Also, there's no need to use HTTP for your requests, so HTTP or TLS overheads are more of self-induced problems of the storage system itself.

You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly.

This has exact same throughput problems as small objects though.

jamesblonde

2 months ago

I agree that this is an anti-pattern for training. In training, you are often I/O bound over S3 - high b/w networking doesn't fix it (.saftensor files are typically 4GB in size). You need NVMe and high b/w networking along with a distributed file system.

We do this with tiered storage over S3 using HopsFS that has a HDFS API with a FUSE client, so training can just read data (from HopsFS datanode's NVMe cache) as if it is local, but it is pulled from NVMe disks over the network. In contrast, writes go straight to S3 vis HopsFS write-through NVMe cache.

hodgesrm

2 months ago

> It seems like they are building a complex distributed system to solve a problem that is better solved by tar -cvf

That doesn't work on Parquet or anything compressed. In real-time analytics you want to load small files quickly into a central location where they can be both queried and compacted (different workloads) at the same time. This is hard to do in existing table formats like Iceberg. Granted not everyone shares this requirement but it's increasingly important for a wide range of use cases like log management.

fulafel

2 months ago

You can do app optimizations to work with object databases that are slow for small objects, or you can have a fast object database - doesn't seem that black and white. If you can build a fast object database that is robust and solves that problem well, it's (hopefully) a non leaky abstraction that can warrant some complexity inside.

The tar -cvf is a good analogy though, are you working with a virtual tape drive or a virtual SSD.

kburman

2 months ago

Expecting the storage layer to fix an inefficient I/O pattern (millions of tiny network requests) is optimizing the wrong part of the stack.

> are you working with a virtual tape drive or a virtual SSD.

Treating a networked object store like a local SSD ignores the Fallacies of Distributed Computing. You cannot engineer away the speed of light or the TCP stack.

fulafel

2 months ago

SSD (over nvme) and TCP (over 100gbe) both exhibit low tens of microseconds of latency as the low bound. This is ignoring redundancy for both of course, but the cost of that should also be similar between the two.

If the storage is farther away, then you'll go slower of course. But since the article is comparing to EFS and S3 Express, it's fitting to talk about a nearby scenarios I think. And the point of the article was that S3 Express was more problematic for cost than small-object performance reasons.

jeremyjh

2 months ago

Yeah I was a bit lost from the introduction. High performance object stores are "too expensive?" We live an era where I can store everything forever and query it in human scale time-frames at costs that are far less than what we paid for much worse technologies a decade ago. But I was thinking of datalakes, not vector stores or whatever they are trying to solve for AI.

Scubabear68

2 months ago

Loved your sentence at the end about tar -cvf.

Every generation seems to have to learn the lesson about batching small inputs together to keep throughput up.

Aperocky

2 months ago

So they built an object storage to replace filesystem.

And in "Why Not Just Use a Filesystem?", the answer they gave is "the line is already blurring" and "industry is converging".

The line maybe blurring but as mentioned is still a clear cut use case for file system - or if higher access speed is warranted, just slap more RAM to the system and cache them. It will still cost less even at current cost of RAM.

zozbot234

2 months ago

AIUI, one obvious difference between object storage and file system (beyond things like support for directories and file name lookups, which OP talks about already) is that an object storage has only atomic file store/replace, whereas a file system has to support arbitrary edits on both file content and directories/metadata.

Aperocky

2 months ago

Yes, so file system is a superset of object storage, making this even less of a competition. It's easy to implement object storage on FS vs. the other way around.

pjdesno

2 months ago

Because (a) you have to mount a file system, so the user running the app needs permission to do that, and (b) It’s really hard to have a filesystem shared across untrusting admin domains.

With S3 you just do an http request and you’re done.

A lot of folks get hung up on the theoretical equivalence of things, and forget that their favorite solution may be flat out unworkable in practice for reasons that have nothing to do with the theoretical features they’re talking about.

Aperocky

2 months ago

This is absolutely correct and the reason why S3 exist and are popular. However have you looked at the case being discussed here? There's a place for object storage and the pattern as discussed might even warrant being a cache or cache cluster.

oersted

2 months ago

Small objects and low latency.

Why not use any of the great KV stores out there? Or a traditional database even.

People use object storage for the low cost, not because it is a convenient abstraction. I suspect some people use the faster expensive S3 simply as a stopgap. Because they started with object storage, the requirements changed, it is no longer the right tool for the job but it is a hassle to switch, and AWS is taking advantage of their situation. I suppose that offering an alternative to those people for a non-extortionate price is a decent business model, but I am not sure how big that market is or how long it will last. And it's not really a question of better tech, I'm sure AWS could make it a lot cheaper if they wanted to.

But object storage at the price of a database with the performance of a database, is just a database, and I doubt that quickly reinventing that wheel yielded anything too competitive.

pjdesno

2 months ago

Because people don’t.

I’ve spent a bunch of time analyzing IBM’s publicly released Cloud Object Storage traces. Median object size is about 16K, mean is a megabyte or two. A decent number of tenants have mean object sizes under 100K.

People use object storage for a bunch of reasons. In general you’re better off supporting what your users are doing than demanding that they rewrite their applications because you think they’re doing it all wrong.

hansvm

2 months ago

Nice. I was looking at building an object store myself. It's fun to see what features other people think are important.

I'm curious about one aspect though. The price comparison says storage is "included," but that hides the fact that you only have 2TB on the suggested instance type, bringing the storage cost to $180/TB/mo if you pay each year up-front for savings, $540/TB/mo when you consider that the durability solution is vanilla replication.

I know that's "double counting" or whatever, but the read/write workloads being suggested here are strange to me. If you only have 1875GB of data (achieved with 3 of those instances because of replication) and sustain 10k small-object (4KiB) QPS as per the other part of the cost comparison, you're describing a world where you read and/or write 50x your entire storage capacity every month.

I know there can be hot vs cold objects or workloads where most data is transient, but even then that still feels like a lot higher access amplification than I would expect from most workloads (or have ever observed in any job I'm allowed to write about publicly). With that in mind, the storage costs themselves actually dominate, and you're at the mercy of AWS not providing any solution even as cheap as 6x the cost of a 2-year amortized SSD (and only S3 comes close -- it's worse when you rent actual "disks," doubly so when they're high-performance).

orliesaurus

2 months ago

I'm more interested in the design philosophy behind these projects than which benchmarks top the charts...

A lot of the high performance S3 alternatives trumpet crazy IOPS numbers, but the devil is in how they handle metadata and consistency. FractalBits says it offers strong consistency and atomic rename ([Why We Built Another Object Storage (And Why It's Different)](https://fractalbits.com/blog/why-we-built-another-object-sto...)), which makes it different from most eventual consistency S3 clones. That implies a full‑path indexing metadata engine (something they mention in a LinkedIn post). That’s a really interesting direction because it potentially avoids some of the inode bottlenecks you see in Ceph and MinIO.

BUT the real question for me is long‑term sustainability. Running your own object store is a commitment. Who's maintaining it when the original team moves on? It's great to see new entrants with ideas, ALSO it would be reassuring if there were clear governance and a non‑profit steward at some point.

I don't mind if something uses AI to draft marketing copy... as long as the code is readable, reviewed, and licensed in a way that keeps it open. The space is crowded, and differentiation often comes down to the less flashy stuff: operational tooling, monitoring, easy deployment across zones, and how it fails. I'm curious to see where this one goes.

dbacar

2 months ago

One can only hope this does not go to same direction like Minio once they gain momentum.

pyrolistical

2 months ago

They claim AI workflows require:

1. Small Objects at Scale

2. Latency Sensitivity

3. The Need for Directories

I’m skeptical on the last one. They talk about rename performance as being the issue.

I think what they mean is if you use path as the object key, if you rename a directory in the middle of a path, you need rename every object key that uses it.

But to me that is just a poor usage of an object store. You should never “rename” object keys.

Consider how git does it. If you rename a directory and diff it, the underlying object store didn’t rename any key. In fact all the files in the object stores are unchanged. Only the tree file changed, which maps paths to file hashes.

While renames would get faster that way, it would increase latency to do a path to object key look up.

I would like to see how fundamental the requirement to have directories are to AI workflows. I suspect it’s human “but I’m used to it” requirement

munchbunny

2 months ago

> I would like to see how fundamental the requirement to have directories are to AI workflows.

In my experience, it's not that directories are inherently important, it's that an organization mechanism is, in the service of a few key problems:

1. Privacy and data handling requirements

2. Versioning

3. Partitioning

4. Probably some others I'm forgetting

Hierarchical storage is a useful all-purpose tool for these things.

pyrolistical

2 months ago

How many of those problems are not solved by independent (s3 concept of) buckets?

fractalbits

2 months ago

github page: https://github.com/fractalbits-labs/fractalbits-main

tsuru

2 months ago

Every time I hear hierarchical storage, I can't help but think "It's all coming back to MUMPS, isn't it?"

up2isomorphism

2 months ago

This is not an area an open source will work. Particularly your main target is on cloud customers.

If you are confident with your work, you should not open your source because that’s the single leverage you have.

firesteelrain

2 months ago

How does this compare to Dell’s ObjectScale?

We eliminated MinIO on vSAN in lieu of ObjectScale for on prem.

imvetri

2 months ago

Can you say, how is this diferent in terms of data structure over conventional one please?

websiteapi

2 months ago

it's always interesting to me how our profession keeps reimplementing the same sort of thing over and over and over again. is it just inherent to the ease in which our experiments can be conducted?

pyrolistical

2 months ago

It’s different when it’s a product/service offering.

In this case it’s more competition. Good for us the consumer.

ChocolateGod

2 months ago

so they added a metadata engine to S3?

How does that compare to something like JuiceFS.

whinvik

2 months ago

Interesting. Have you seen any benefits of using io-uring. It seems io-uring is constatly talked about but no one seems to be really using it in anger.

6r17

2 months ago

Io-uring has it's fair amount of CVEs ; I'm wondering if people are checking these out ; because the goal is not to just make something fast ; but fast & secure. It's a little bit of a grey area in my opinion for prod on public machines. Anyone has a counter view on this I'm genuinely curious maybe i'm over cautious ?

ps : there are actually other faster and more secure options than io-uring but I won't spoil ;)

hansvm

2 months ago

My understanding is that the iouring CVEs are about local privilege escalation, not being appropriately sandboxed, etc. If you're only running code you trust on machines with iouring enabled then you're fine (give or take "defense in depth").

Is that not accurate?