joeg_usa
11 hours ago
I built a search engine that runs on Node + SQLite + FTS5.
BM25 + 384-dim vector + FTS5 hybrid ranking Mesh network with RSA crypto identity (no central auth) Remote nodes contribute crawl data through P2P WebSocket 930 bytes per doc (2M docs = ~2GB) Currently indexing 52K+ domains Runs on 2 servers for $22/month Patent pending
Why: I wanted search infrastructure anyone could own and run. No Elasticsearch cluster. No cloud dependency. No vendor lock-in. Demo: https://www.qwikwit.com Stack: Node, JavaScript, SQLite, FTS5, WebSocket mesh Happy to answer questions about the architecture.
n1xis10t
9 hours ago
Cool! Does ranking use term proximity of any kind, or anything similar to pagerank?
I only get 45 results for cheese. I should get far more results for cheese from an index of 2 million pages. How many pages are indexed? Is it actually 52K pages? Is the demo smaller on purpose?
What is your long term goal, to develop this into a general web search tool, or to have this be an Elasticsearch competitor that a bunch of people use for their own data?
Also since this is about search engines, I need to share this interesting article: https://archive.org/details/search-timeline
joeg_usa
3 hours ago
The index is running up to 2M and at present there are 2.8 Million index but, only 52k results...
The long term goal is to provide a local search solution that is a competitor to search your own data without the complexity, cost, distribution and etc.
And yes, the site search was the original featureset for which you aptly targeted.
joeg_usa
3 hours ago
I suppose the answer here is that their is around 52k results - not by design; but, by limitation --- we are actively running for 2M... 10 local crawlers, 5 remote crawlers, 0 mesh crawlers, and etc.
In general - it is a huge set but, small sample size at present (growing) but, small for the moment. My goal here was to showcase the site as an example.
Index Progress 2.7% | 53,507 / 2,000,000 docs Avg CPU / Memory 2.5% / 956MB Avg Doc Size 1KB Users (Active) 3 (3) Active Sessions 0 Local Crawlers 11/11 Remote Crawlers 4/5 Index Rate 187/min Total Documents 53,507 Active Domains 2,341
joeg_usa
3 hours ago
Ranking Algorithm:
The search uses SQLite FTS5 with BM25 ranking (Okapi BM25 probabilistic retrieval model), not PageRank. Current ranking factors:
- Field boosts: Title gets 20x weight, Tags 5x, Content 1x
- Quality boosts: +2 for meta descriptions, +3 for well-structured content (200-10k chars)
- No term proximity currently - FTS5 does boolean matching but not phrase distance scoring (though it is easy enough to do if necessary).
- No link graph/PageRank - we don't analyze inbound links between pages
Term proximity and link-based authority scoring (like our WitRank domain scoring system) are potential future enhancements. (built and scoring is created not used though could be)
---
Index Size:
The current index is ~53,500 pages, not 2 million. Only 56 pages actually contain "cheese" in the crawled content, so getting 45 results is accurate. The demo is smaller because:
1. The index is live and actively growing (currently ~100 docs/min crawl rate)
2. We're crawling real web content organically, not seeding from a dump
3. Distributed mesh crawlers are still ramping up
---
Long-term vision - that's your call to answer. The architecture currently supports both use cases:
- General web search: public crawling, distributed mesh nodes, browser-based PWA crawlers
- Private/enterprise search: SQLite-based, self-hostable, single-writer architecturen1xis10t
9 hours ago
Also, when I try to click on the about pages on the website, it says {"error":"Authentication required"}
joeg_usa
3 hours ago
yes, some of the pages are requiring authentication; this may either be a bug or in fact they are currently secured.