hackernews client

joeg_usa

2 months ago

I built a search engine that runs on Node + SQLite + FTS5.

BM25 + 384-dim vector + FTS5 hybrid ranking Mesh network with RSA crypto identity (no central auth) Remote nodes contribute crawl data through P2P WebSocket 930 bytes per doc (2M docs = ~2GB) Currently indexing 52K+ domains Runs on 2 servers for $22/month Patent pending

Why: I wanted search infrastructure anyone could own and run. No Elasticsearch cluster. No cloud dependency. No vendor lock-in. Demo: https://www.qwikwit.com Stack: Node, JavaScript, SQLite, FTS5, WebSocket mesh Happy to answer questions about the architecture.

n1xis10t

2 months ago

Cool! Does ranking use term proximity of any kind, or anything similar to pagerank?

I only get 45 results for cheese. I should get far more results for cheese from an index of 2 million pages. How many pages are indexed? Is it actually 52K pages? Is the demo smaller on purpose?

What is your long term goal, to develop this into a general web search tool, or to have this be an Elasticsearch competitor that a bunch of people use for their own data?

Also since this is about search engines, I need to share this interesting article: https://archive.org/details/search-timeline

joeg_usa

2 months ago

The index is running up to 2M and at present there are 2.8 Million index but, only 52k results...

The long term goal is to provide a local search solution that is a competitor to search your own data without the complexity, cost, distribution and etc.

And yes, the site search was the original featureset for which you aptly targeted.

joeg_usa

2 months ago

I suppose the answer here is that their is around 52k results - not by design; but, by limitation --- we are actively running for 2M... 10 local crawlers, 5 remote crawlers, 0 mesh crawlers, and etc.

In general - it is a huge set but, small sample size at present (growing) but, small for the moment. My goal here was to showcase the site as an example.

Index Progress 2.7% | 53,507 / 2,000,000 docs Avg CPU / Memory 2.5% / 956MB Avg Doc Size 1KB Users (Active) 3 (3) Active Sessions 0 Local Crawlers 11/11 Remote Crawlers 4/5 Index Rate 187/min Total Documents 53,507 Active Domains 2,341

joeg_usa

2 months ago

  Ranking Algorithm:

  The search uses SQLite FTS5 with BM25 ranking (Okapi BM25 probabilistic retrieval model), not PageRank. Current ranking factors:

  - Field boosts: Title gets 20x weight, Tags 5x, Content 1x
  - Quality boosts: +2 for meta descriptions, +3 for well-structured content (200-10k chars)
  - No term proximity currently - FTS5 does boolean matching but not phrase distance scoring (though it is easy enough to do if necessary).
  - No link graph/PageRank - we don't analyze inbound links between pages

  Term proximity and link-based authority scoring (like our WitRank domain scoring system) are potential future enhancements. (built and scoring is created not used though could be)

  ---
  Index Size:

  The current index is ~53,500 pages, not 2 million. Only 56 pages actually contain "cheese" in the crawled content, so getting 45 results is accurate. The demo is smaller because:

  1. The index is live and actively growing (currently ~100 docs/min crawl rate)
  2. We're crawling real web content organically, not seeding from a dump
  3. Distributed mesh crawlers are still ramping up

  ---
  Long-term vision - that's your call to answer. The architecture currently supports both use cases:
  - General web search: public crawling, distributed mesh nodes, browser-based PWA crawlers
  - Private/enterprise search: SQLite-based, self-hostable, single-writer architecture

n1xis10t

2 months ago

Thanks for all the information! Is there anywhere I can go to read the patent application?

joeg_usa

2 months ago

my apology for the delay - USPTO - (and we are at 1.5m now)

63-909979

Decentralized search engine – Node, SQLite, mesh network, $22/mo to run

10 Comments

joeg_usa

n1xis10t

joeg_usa

joeg_usa

joeg_usa

n1xis10t

joeg_usa

n1xis10t

n1xis10t

joeg_usa