Nixiesearch: Running Lucene over S3, and why we're building a new search engine

85 pointsposted 11 hours ago
by shutty

61 Comments

jillesvangurp

6 hours ago

Both Elastic and Opensearch also have S3 based stateless versions of their search engines in the works. The Elastic one is available in early access currently. It would be interesting to see how this on improves on both approaches.

With all the licensing complexities around Elastic, more choice is not necessarily bad.

The tradeoff with using S3 is indexing latency (the time between the write getting accepted and being visible via search) vs. easy scaling. The default refresh interval (the time the search engine waits before committing changes to an index) is 1 second. That means it takes upto 1 second before indices get updated with recently added data. A common performance tweak is to increase this to 5 or more seconds. That reduces the number of writes and can improve write throughput, which when you are writing lots of data is helpful.

If you need low latency (anything where users might want to "read" their own writes), clustered approaches are more flexible. If you can afford to wait a few seconds, using S3 to store stuff becomes more feasible.

Lucene internally stores documents in segments. Segments are append only and there tend to be cleanup activities related to rewriting and merging segments to e.g. get rid of deleted documents, or deal with fragmentation. Once written, having some jobs to merge segments in the background isn't that hard. My guess is that with S3, the trick is to gather whatever amount of writes up and then store them as one segment and put that in S3.

S3 is not a proper file system and file operations are relatively expensive (compared to a file system) because they are essentially REST API calls. So, this favors use cases where you write segments in bulk and never/rarely update or delete individual things that you write. Because that would require updating a segment in S3, which means deleting and rewriting it and then notifying other nodes somehow that they need to re-read that segment.

For both Elasticsearch and Opensearch log data or other time series data fits very well to this because you don't have to deal with deletes/updates typically.

mdaniel

6 hours ago

> Nixiesearch uses an S3-compatible block storage (like AWS S3, Google GCS and Azure Blob Storage)

Hair-splitting: I don't believe Blob Storage is S3 compatible, so one may want to consider rewording to distinguish between whether it really, no kidding, needs "S3 compatible" or it's a euphemism for "key value blob storage"

I'm fully cognizant of the 2017 nature of this, but even they are all "use Minio" https://opensource.microsoft.com/blog/2017/11/09/s3cmd-amazo... which I guess made a lot more sense before its license change. There's also a more recent question from 2023 (by an alleged Microsoft Employee!) with a very similar "use this shim" answer: https://learn.microsoft.com/en-us/answers/questions/1183760/...

ko_pivot

5 hours ago

Azure is the only major (or even minor) cloud provider refusing to build an S3 API. Strange to me, because Azure Cosmos DB supports Mongo and Cassandra at the API level, for example, so idk what is so offensive to them about S3 becoming the standard HTTP API for object storage.

ignaloidas

5 hours ago

It's because S3 api is quite a fair bit worse than what they offer. They define their guarantees for storage products way more clearly than other clouds, and for blob storage, from my understanding, their model is better than S3.

oersted

8 hours ago

Check out Quickwit, it is briefly mentioned but I think mistakenly dismissed. They have been working on a similar concept for a few years and the results are excellent. It’s in no way mainly for logs as they claim, it is a general purpose cloud native search engine like the one they suggest, very well engineered.

It is based on Tantivy, a Lucene alternative in Rust. I have extensive hands on experience with both and I highly recommend Tantivy, it’s just superior in every way now, such a pleasure to use, an ideal example of what Rust was designed for.

Semaphor

7 hours ago

> It’s in no way mainly for logs as they claim

Where can I find more information on using it for user-facing search? The repository [0] starts with "Cloud-native search engine for observability (logs, traces, and soon metrics!)" and keeps talking about those.

[0]: https://github.com/quickwit-oss/quickwit

oersted

7 hours ago

That just seems to be the market where search engines have the most obvious business case, Elasticsearch positioned themselves in the same way. But both are general-purpose full-text search engines perfectly capable of any serious search use-case.

Their original breakout demo was on Common Crawl: https://common-crawl.quickwit.io/

But thanks for pointing it out, I hadn't looked at it in a few months, it looks like they significantly changed their pitch in the last year. I assume they got VC money and they need to deliver now.

AsianOtter

5 hours ago

But the demo does not work.

I tried "England is" and a few similar queries. It spends three seconds then shows that nothing is found.

oersted

5 hours ago

I tried it once and it instantly showed no results, but then I tried it again and it returned results in <1s. Just try it with a bunch of queries, I think there's caching too so it's hard to gauge performance properly.

The blog post about the demo is from 2021 and they haven't promoted it much since. I'm surprised that they even kept it online, according to the sidebar it was ~$810/month in AWS at the time.

erk__

7 hours ago

I have been using Tantivy for Garfield comic search for a few years now, it has been really nice to use in all that time.

jprd

4 hours ago

I'm simultaneously intrigued and thinking this is a funny joke at the same time. If this isn't a joke, I would love an example.

erk__

2 hours ago

Luckily it is not a joke!

Its been about I have had running in some capacity for some years by now through a couple of rewrites. At some point Discord added "auto-complete" for commands, this meant that I can do a live lookup and give users a list of comics where some piece of text is.

My index is a bit out of date, but comics before September last year can be searched up.

The search index lives fully in memory as it is not that big since it is only 17363 comics. This does mean that it is rebuilt every startup, but that does not take long compared to the month long uptime it usually has.

Example of a search for "funny joke": https://imgur.com/a/J4sRhPJ

Hosted bot: https://discord.com/application-directory/404364579645292564

Source code: https://git.sr.ht/~erk/lasagna

ZeroCool2u

5 hours ago

notamy

2 hours ago

Meilisearch is great when it works, but when it breaks it's a total nightmare. I've hit multiple bugs that destroyed my search index, I've hit multiple undocumented limits, ... that all required rebuilding my index from scratch and doing a lot of work to find what was actually going on to report it. It doesn't help that some of the errors it gives are incredibly non-specific and make it quite difficult to find what's actually breaking it.

All of that said, I still use it because it has sucked less than the other search engines to run.

bomewish

7 hours ago

The big issue with tantivy I've found is that it only deals with immutable data. So it can't be used for anything you want to do CRUD on. This rules out a LOT of use cases. It's a real shame imo.

oersted

7 hours ago

It is indeed mostly designed for bulk indexing and static search. But it is not a strict limitation, frequent small inserts and updates are performant too. Deleting can be a bit awkward, you can only delete every document with a given term in a field, but if you use it on a unique id field it's just like a normal delete.

Tantivy is a low-level library to build your own search engine (Quickwit), like Lucene, it's not a search engine in itself. Kind of like how DBs are built on top of Key-Value Stores. But you can definitely build a CRUD abstraction on top of it.

parhamn

an hour ago

Stateless S3 apps have much more appeal given the existence of Cloudflare R2 -- bandwidth is free and GetObject is $0.36 per million requests.

hipadev23

3 hours ago

I know block storage backends is all the rage, but this is about the most capital intensive thing you can do on the major cloud providers. Storage and reads are cheap, but writes and list operations are insanely expensive.

Once you hook these backends up to real-time streaming updates, transactions, heavy indexing, or immutable backends that cause constant churn (hive/hudi/iceberg/delta lake), you're in for a bad time financially.

gyre007

8 hours ago

It took us almost 2 decades but finally the truly cloud native architectures are becoming a reality. Warp and Turbopuffer are some of the many other examples

candiddevmike

8 hours ago

Curious what your definition of cloud native is and why you think this is a new innovation. Storing your state in a bunch of files on a shared disk is a tale as old as time.

cowsandmilk

6 hours ago

Not having to worry about the size of the disk for one. So much time in on that premise systems was about managing quotas for systems and users alongside the physical capacity.

warangal

6 hours ago

I myself have been working on a personal search engine for sometime, and one problem i faced was to have an effective fuzzy-search for all the diverse filenames/directories. All approaches i could find were based on Levenshtein distance , which would have led to storing of original strings/text content in the index, and neither would be practical for larger strings' comparison nor would be generic enough to handle all knowledge domains. This led me to start looking at (Local sensitive hashes) LSH approaches to measure difference b/w any two strings in constant time. After some work i finally managed to complete an experimental fuzzy search engine (keyword search is a just a special case!).

In my analysis of 1 Million hacker news stories, it worked much better than algolia search while running on a single core ! More details are provided in this post: https://eagledot.xyz/malhar.md.html . I tried to submit it here to gather more feedback but didn't work i guess!

iudqnolq

6 hours ago

I'm super new to this so I'm probably missing something simple, but isn't a trigram index one of the canonical solutions for fuzzy search? Eg https://www.postgresql.org/docs/current/pgtrgm.html

That often involves recording original trigram position, but I think that's necessary to weigh "I like happy cats" higher than "I like happy dogs but I don't like cats" in a search for "happy cats".

warangal

5 hours ago

Yes, trigram mainly but also bigram and/or combination of both are used generally to implement fuzzy search, zoekt also uses trigram index. But such indices depend heavily on the content being indexed, for example if ever encounter a rare "trigram" during querying not indexed, they would fail to return relevant results! LSH implementations on the other hand employ a more diverse collection of stats depending upon the number of buckets and N(-gram)/window-size used, to compare better with unseen content/bytes during querying. But it is not cheap as each hash is around 30 bytes, even more than the string/text being indexed most of the time ! But its leads to fixed size hashes independent of size of content indexed and acts as an "auxiliary" index which can be queried independently of original index! Comparison of hashes can be optimized leading to a quite fast fuzzy search .

mikeocool

8 hours ago

I love all of the software coming out recently backed by simple object storage.

As someone who spent the last decade and half getting alerts from RDBMSes I’m basically to the point that if you think your system requires more than object storage for state management, I don’t want to be involved.

My last company looked at rolling out elastic/open search to alleviate certain loads from our db, but it became clear it was just going to be a second monstrously complicated system that was going to require a lot of care and feeding, and we were probably better off spending the time trying to squeeze some additional performance out of our DB.

spaceribs

8 hours ago

This is a very unix philosophy right? Everything is a file?[1]

[1]https://en.wikipedia.org/wiki/Everything_is_a_file

pjc50

6 hours ago

Not quite - "everything is a blob" has very different concurrency semantics to "everything is a POSIX file". You can't write into the middle of a blob, for example. This makes certain use cases harder but the concurrency of blobs is much easier to reason about and get right.

Personally I think you might actually need a DB to do the work of a DB, and you can't as easily build one on top of a blob store as on a block device. But I do think most distributed systems should use blob and/or DB and not the filesystem.

remram

6 hours ago

On the other hand, the S3-compatible server options are quite limited. While you're not locking yourself to one cloud, you are locking yourself to the cloud.

mikeocool

4 hours ago

At this point my career, I’ve found that paying to make something hard someone else’s is often well worth it.

candiddevmike

8 hours ago

Why would you prefer state management in object storage vs a relational (or document) database?

orthecreedence

3 hours ago

Two main reasons I can see:

Ops is easier, for the most part. Doing ops on an RDBMS correctly can be a pain. Things like replication, failover, performance tuning, etc etc can be hard. This is much less of an issue because services like RDS solve this and have solved it for a long time. Not a huge issue there.

Splitting compute from storage makes scaling a lot easier, especially when storage is an object store system where you don't have to worry about RAID, disk backups, etc etc. Especially for clustered systems like elasticsearch, having object store backing would be incredible: if you need to spin up/down a new server, instead of starting it, convincing it to download the portions of the indexes it's supposed to and waiting for everything to transfer, you just start it and let it run immediately. You can also now run 80% spot instances for your compute nodes because if one gets recalled, the replacement doesn't have to sync all its state from the other servers, it can just go to business as usual, and a sudden loss of 60% of your nodes doesn't mean data loss like it does if your nodes are holding all the state.

I think for something like an RDBMS, object-store backing is very likely completely overkill, unless you're hitting some scaling threshold that most of us don't deal with ever. For clustered DB systems (cassandra/scylla, ES, etc etc), splitting out storage makes cluster management, scalability, and resiliency worlds easier.

mikeocool

8 hours ago

So many less moving parts to manage/break.

mhitza

8 hours ago

I've used offline indexing with Solr back in 2010-2012, and this was because the latency between the Solr server and the MySQL db (indexing done via dataimport handler) was causing the indexer to take hours instead of the sub 1 hour (same server vs servers in same datacenter).

In many ways Solr has come a long way since, and I'm curious to see how well they can make a similar system perform in the cloud environment.

whalesalad

6 hours ago

I recently got back into search after not touching ES since like 2012-2013. I forgot how much of a fucking nightmare it is to work with and query. Love to see innovation in this space.

staticautomatic

6 hours ago

I feel like it’s not that bad to interact with if you do it regularly, but if I go a while without using it I forget how to do everything. I sure as hell wouldn’t want to admin an instance.

huntaub

4 hours ago

This is a super cool project, and I think that we will continue to see more and more applications move towards an "on S3" stateless architecture. That's part of the reason why we are building Regatta [1]. We are trying to enable folks who are running software that needs file system semantics (like Lucene) to get the super-fast NVME-like latencies on data that's really in S3. While this is awesome, I worry about all of the applications which don't have someone rewrite a bunch of layers to work on S3. That's where we come in.

[1] https://regattastorage.com

marginalia_nu

8 hours ago

This would have been a lot easier to read without all the memes and attempts to inject humor into the writing. It's a frustrating because it's an otherwise interesting topic :-/

prmoustache

8 hours ago

How hard is it to just jump past them?

Answere: it is not.

infecto

8 hours ago

It generally is a major distraction from the content and feels like a pattern from a decade+ ago when technical blog posts became the hot thing to do.

You can certainly jump over it but I imagine a number of people like myself just skip the article entirely.

vundercind

4 hours ago

I like the style, but this case felt forced. Like when corporate tries to do memes.

mannyv

6 hours ago

I forgot that a reindex on solr/lucene blows away the index. Now I remember how much of a nightmare that was because you couldn't find anything until that was done - which usually was a few hours when things were hdd based.

Just started a search project, and this one will be on the list for sure.

manx

8 hours ago

I thought about creating a search engine using https://github.com/phiresky/sql.js-httpvfs, commoncrawl and cloudflare R2. But never found the time to start...

mallets

8 hours ago

Many things seem feasible with competitive object storage pricing. Still needs a little a bit of local caching to reduce read requests and origin abuse.

I think rclone mount can do the same thing with its chunked reads + cache, wonder what's the memory overhead for the process.

ko_pivot

8 hours ago

I’m a fan of all these projects that are leveraging S3 to implement high availability / high scalability for traditionally sensitive stateful workloads.

Local caching is a key element of such architectures, otherwise S3 is too slow and expensive to query.

candiddevmike

8 hours ago

The write speed is going to be horrendous IME, and how do you handle performant indexing...

ctxcode

5 hours ago

Sounds like this is going to cost alot of money. (more than it should)

stroupwaffle

7 hours ago

There’s no such thing as stateless, and there’s no such thing as serverless.

The universe is a stateful organism in constant flux.

Put another way: brushing-it-under-the-rug as a service.

zdragnar

7 hours ago

There is no spoon.

Put it another way: serverless and stateless don't mean what you think they mean.

MeteorMarc

6 hours ago

I feel clueless

stroupwaffle

6 hours ago

It’s not the spoon that bends, it’s the world around it.

ctxcode

6 hours ago

serverless just means that a hosting company routes your domain to one or more servers that hosting company owns and where they put your code on. And that hosting company can spin up more or less servers based on traffic.. TL;DR; Serverless uses many many servers, just none that you own.

zdragnar

4 hours ago

More specifically: no instances that you maintain or manage. You don't care which machine your code runs on, or even if all your code is even on the same machine.

Compute availability is lumped into one gigantic pool and all of the concerns below the execution of your code is managed for you.