skuznetsov37
8 hours ago
Author here. pg_sorted_heap is a PostgreSQL table AM extension that does two things:
1. Keeps data physically sorted by primary key with per-page zone maps. At 100M rows a point query reads 1 buffer page (0.045ms) vs 8 for btree (0.5ms) vs 520K for seq scan (1.2s).
2. Built-in IVF-PQ vector search — no pgvector dependency.
The vector search part is where it gets interesting. The physical clustering by PK prefix IS the inverted file index. You set partition_id (IVF cluster assignment) as the leading PK column, sorted_heap
clusters rows by it, and the zone map skips irrelevant partitions at the I/O level. No separate index structure, no 800 MB HNSW graph.
Numbers (103K × 2880-dim vectors, 1 Gi k8s pod):
pgvector HNSW: 97% R@1, 14ms, 806 MB index, max 2,000 dims
IVF-PQ: 97% R@1, 22ms, 27 MB index, max 32,000 dims
Two vector types: svec (float32, 16K dims) and hsvec (float16, 32K dims). We tested float16 vs float32 on the same dataset — no measurable recall difference. PQ quantization is the bottleneck, not storage
precision.
The 2,000-dim limit matters: models like Nomic embed v2 output 2880 dims, and pgvector simply can't build HNSW or IVFFlat indexes for them.
PostgreSQL 17 + 18, Apache 2.0. Full crash recovery, online compaction, TOAST, pg_upgrade all tested.
Vector search docs: https://skuznetsov.github.io/pg_sorted_heap/vector-search
Happy to answer questions about the architecture, benchmarks, or trade-offs vs pgvector.